pvv-chan

A take on creating a chatbot based on custom data input.

Go to file

Adrian Gunnar Lauterer 418d6d044d init		2024-05-27 18:42:18 +02:00
.gitignore	init	2024-05-27 18:42:18 +02:00
assistant.py	init	2024-05-27 18:42:18 +02:00
flake.lock	init	2024-05-27 18:42:18 +02:00
flake.nix	init	2024-05-27 18:42:18 +02:00
image.py	init	2024-05-27 18:42:18 +02:00
llm.py	init	2024-05-27 18:42:18 +02:00
README.md	init	2024-05-27 18:42:18 +02:00
stt.py	init	2024-05-27 18:42:18 +02:00
tts.py	init	2024-05-27 18:42:18 +02:00

README.md

This is a simple chatbot project. The aim is to recreate something similar to neurosama, running on local hardware on a minimal amount of compute.

The bot is designed to be modular, with the ability to add new modules easily.

You need to supply a backup mediawiki xml. this is used to gather information to the chatbot.

A strong computer with cuda and a fair bit of vrm is adviced to get response times down.

Most settings are configured through enviroment variables from the flake.nix file.

Modules

stt

The stt module is responsible for converting speech to text. Whisper-cpp-stream is used to stream audio through the whisper stt engine. whisper-cpp-stream is a c++ program that reads audio from a microphone, and sends it to the whisper stt engine. It is run through a python subprocess.

llm

The llm module is responsible for crafting a response to the user's input. It uses a rag based on a supplied mediawiki wiki xml file, and in the future, included chat history.

langchain is the pyhton module that interfaces with the rag, and llm. ollama is used on the backend to interface with a llama model.

future work will include giving astructured response, to include emotions, and metadata for a future image module.

tts

piper is used as the tts engine. It does not have proper python bindings in nixpkgs, so it is run with subprocess. text is echoed into piper's stdin, and the output is played with aplay.

image

The image module is responsible for processing images. It captures the image using pygame, b64 encodes it and sends it to a multimodal model for descriptions. Future work is to test out using opencv or something similar for image tagging instead, as the multimodal model halucinates a lot, and is also way too slow.