Project Still: Private Digital Legacy Stack

0) Motivation#

To my friend who passed away in 2024.

This project shouldn’t claim to be him but it is entirely inspired by him.

There are many problems in this project. Thus, the main goal of this write-up is to reflect and identify a path to improve and iterate.

Although this is just a first attempt on representing a real person from a digital perspective, I would treat it as my long-term project and eventually would build a 95% fidelity AI digital human with better base model and more data collected & cleaned, while resolving the intended problems discussed in this article.

1) System Overview#

Project Still is a three modal pipeline (video, audio, text) that drives an anonymized persona representation for real-time WeChat interaction.

For privacy and safety, this blog is only shared on a high-level, that means no code, datasets, or identifying details are shared.

Architecture#

Figure 1: High-level data flow separating the Offline Training (Linux) and Runtime Inference (Windows) Training on Debian. Deploying on Windows locally.

Runtime Environment#

Training: Debian Linux (for dataset processing and fine-tuning)
Deployment: Windows (for inference and post-training parameters adjusting)
Hardware: RTX 4090 + Intel i9-14900K

Deployment modes (VRAM-aware)#

Due to the 24GB VRAM limit of the RTX 4090 I have, simultaneous inference of the visual, audio, and LLM stacks is currently not possible without heavy quantization. Thus, the system is separate to two modes.

Call Mode (visual + voice): running face swap and voice conversion simultaneously during video calling.
- GPU VRAM usage: ~8.9 GB
Chat Mode (text): running the local chatbot, separate from face swap and voice conversion
- GPU VRAM usage: ~19.2 GB (text model runs in FP16)

Performance#

Visual: 25 - 30 FPS real-time face swap with 3840×2160 webcam input
Voice: end-to-end latency typically ~0.3–0.5 s range during real-time conversion
Text: messages reply are intentionally rate-limited (seconds-level, configurable) due to WeChat platform constraints

Reliability Notes#

Longest stable run: ~ 12 hours continuous runtime (but can run longer under normal conditions)
Most common failure points:
- Visual: face occlusion (hands / objects) can cause face detection to drop
- Voice: latency is dominated by buffering and audio routing stability when using Virtual Sound Card
- Text: in some cases, tone is consistent but messages content becomes incoherent

Design Priorities#

Project Still is optimized for: perceived realism > stability > cost. In practice, this project prioritizing high-fidelity output and real-time interaction, also keeping the system private and local-only.

2) Privacy & Publishing Constraints#

Project Still is a private digital-legacy system. While the underlying components and tools are widely known in the community, this particular project includes sensitive personal data. For that reason, this write-up only intentionally share high-level architecture and lessons learned.