t1sun's Space

Back

Project Still: Private Digital Legacy StackBlur image

0) Motivation#

To my friend who passed away in 2024, and his family who want to preserve connection and grasp the gradually fading of memory.

This project shouldn’t claim to be him but it is entirely inspired by him.

There are many problems in this project. Thus, the main goal of this write-up is to reflect and identify a path to improve and iterate.

Although this is just a first attempt on representing a real person from a digital perspective, I would treat it as my long-term project and eventually would build a 95% fidelity AI digital human with better base model and more data collected & cleaned, while resolving the intended problems discussed in this article.

1) System Overview#

Project Still is a three modal pipeline (video, audio, text) that drives an anonymized persona representation for real-time WeChat interaction.

In this V1 prototype, I chose real-time face swap and voice conversion as the call-mode front end because they are practical under limited persona data and can achieve convincing real-time results with relatively small datasets.

However, the direction of Project Still in the long term is to transition call-mode from identity transfer (swap/conversion) to generation: a virtual avatar (face generation) plus TTS driven by the text model. That shift depends on improving the chatbot’s coherence (and adding a retrieval-backed memory system), so the “speaker” is the model itself rather than a human operator.

But during future iteration of this project, virtual face generation and TTS will be integrated. And real-time face swap and voice conversion would be phased out.

For privacy and safety, this blog is only shared on a high-level (mainly architecture and lessons learned), which means no code, datasets, or identifying details are shared.

Architecture#

Figure 1: High-level data flow separating the Offline Training (Linux) and Runtime Inference (Windows) Training on Debian. Deploying on Windows locally.

Runtime Environment#

  • Training Environment: Debian Linux (for dataset processing and fine-tuning)

  • Deployment System: Windows (for inference and post-training parameters adjusting)

  • Hardware: RTX 4090 + Intel i9-14900K

Deployment modes (VRAM-aware)#

Due to the 24GB VRAM limit of the RTX 4090 I used, simultaneous inference of the visual, audio, and LLM stacks at the same time is currently not possible without heavy quantization on the text model. Thus, the system is separate to two modes.

  • Call Mode (visual + voice): running face swapping and voice conversion simultaneously during video calling.

    • GPU VRAM usage: ~8.9 GB
  • Chat Mode (text): running the local chatbot, separate from face swapping and voice conversion

    • GPU VRAM usage: ~19.2 GB (text model runs in FP16)

Performance#

  • Visual: 25 - 30 FPS real-time face swap with 3840×2160 webcam input

  • Voice: end-to-end latency typically ~0.3–0.5 s range during real-time conversion

  • Text: messages reply are intentionally rate-limited (seconds-level, configurable) due to WeChat platform constraints

Reliability Notes#

  • Longest stable run: ~ 12 hours continuous runtime (but can run longer under normal conditions)

  • Most common failure points:

    • Visual: face occlusion (hands / objects) can cause face detection to drop

    • Voice: latency is dominated by buffering and audio routing stability when using Virtual Sound Card

    • Text: in some cases, tone is consistent but messages content becomes incoherent

Design Priorities#

Project Still is mainly optimized for: realism > stability > cost. In practice, this project prioritizes high-fidelity output and real-time interaction, and also keeps the system private and local deploying only.

2) Privacy & Publishing Constraints#

Project Still is a AI digital legacy system. Although the underlying and core components and tools used are widely known in the community, the project includes sensitive personal data. For that reason, this write-up only intentionally share high-level architecture and lessons learned instead of a step-by-step reproduction guide.

What are protected against#

  • Leakage of private conversations and personal information
  • Social harm if others copy the system for impersonation

What are not sharing#

  • Source Code
  • Training data (text, images, audio) and any derived samples
  • Media demonstrations

What are sharing#

  • High-level Architecture and Design Decisions
  • Performance ranges (FPS/latency/VRAM) and Reliability Note
  • Failure modes and mitigation
  • Future work and a Route to improve

Data Handling#

All raw datasets and model checkpoints are stored locally. All examples in this post are either synthetic or abstracted to avoid exposing private information.

3) Module A - Real-time Face Swap (Rope)#

Note: These are V1 call-mode components and may be phased out in later versions in favor of avatar generation + TTS.

This module includes the pipeline and components of Real-time face swap for Wechat Videos via a virtual camera input. The main software used is Rope for real-time face swap and ManyCam for virtual camera input.

3.1) Goal and Constraints + Why Rope#

  • The real camera input is 3840x2160, but the software process at around 1080p internally to maintain real-time performance.
  • Success Criterion: Stable and convincing at conversational distance, but not cinematic/forensic fidelity.

Rope is Chosen for real-time throughput and workable with a small photo set (even with one single photo). However, aiming for high fidelity need to be approached wtih DFM models/DeepFaceLab, as DFM model can better preserve face gestures and skin-level textures more naturally with higher fidelity, but this is in future work due to data requirements, usually 3000 high-quality face photoes in different angles are needed.

3.2) Data + pipeline (high-level)#

  • Data:

    • 15 candid photos, curated for pose/angle coverage rather than volume
    • Preprocessing: Face alignment
  • Pipeline:

4K webcam input → internal downscale (~1080p) → detect/align → swap/blend → output as virtual camera → WeChat call.

3.3) Results (performance + reliability)#

MetricObserved rangeNotes
Webcam input3840×2160Real camera feed
Internal processing~1080pDownscaled for real-time
Face-swap FPS~25–30 FPSScene / occlusion / motion dependent
Call-mode GPU VRAM~8.9 GBFace + voice together
Longest stable run12+ hoursTested continuous runtime

3.4) Failure modes + mitigations (the main content)#

Failure modeWhat it looks likeLikely cause (high-level)Mitigation (method-level)
Face occlusionSwap drops or misaligns if hands/objects cover the faceDetection confidence collapsesKeep face unobstructed; stable framing while avoid hand-to-face gestures
Lighting / texture loss (“beauty filter” effect)Skin looks overly smooth; freckles/pores disappear; realism dropsReal-time blending + downscaling reduces high-frequency detail; low light amplifies noise/blur and worsens texture fidelityImprove lighting (avoid low light; add soft and angled light); keep exposure stable; prefer sharp, well-lit reference photos
Fast motionTemporal jitter and unstable alignment during quick head turnsMotion blur + rapid pose change breaks trackingAvoid fast head turns; keep camera fixed; minimize motion blur

3.5) Takeaway#

In practice, most real-world failures are driven by detection/alignment and lighting rather than model “intelligence.” However, the fidelity of model can be further improved after DFM model is trained.

4) Module B — Real-time Voice Conversion (RVC + W-Okada)#

Note: These are V1 call-mode components and may be phased out in later versions in favor of avatar generation + TTS.

4.1) Goal#

This voice module converts live microphone speech into the target voice in real time for WeChat calls. I used RVC for voice model trainining and W-Okada for real-time inference and audio routing. In the RVC software, the pitch extraction uses RMVPE at 48 kHz.

4.2) Dataset + preprocessing & Deployment pipelin#

  • Data

    • Dataset: 10 minutes of single-speaker Chinese speech from a formal oral assessment recording.
    • Preprocessing: a built-in multi-stage pipeline in the RVC toolchain, including vocal cleanup/denoising and automatic slicing into short segments for training efficiency.
  • Pipeline

Mic input → real-time RVC inference (W-Okada) → virtual audio device routing → WeChat call audio input.

4.3) Results (latency + stability)#

MetricObserved rangeNotes
Sample rate48 kHzReal-time routing baseline
f0 methodRMVPEPitch extraction for conversion
Voice latency (end-to-end)~0.3–0.5 sDominated by buffering + routing + inference
Stability12+ hours (call mode)Most sensitive to audio device configuration

4.4) Failure modes + mitigations#

Failure modeWhat it looks likeLikely cause (high-level)Mitigation (method-level)
Emotion / prosody mismatchTimbre is convincing, but emotion/intonation feels “too formal”Training data is mostly formal speech with limited expressive rangeExpand training audio to include more emotional and conversational speech (future work); keep expectations realistic for current data
Routing instability (configuration sensitivity)Occasional distortion or inconsistent input/output behaviorVirtual audio routing depends on consistent device format and stable system settingsStandardize audio device settings (e.g., mono / 16-bit / 48 kHz) as a reliability baseline

4.5) Takeaway#

The system’s perceived interaction latency is dominated by the voice conversion pipeline (~0.3–0.5 s end-to-end). In terms of quality, timbre similarity is achievable with clean data, but expressive realism (especially emotion when speaking) is primarily limited by the diversity of the training speech domain.

5) Module C — WeChat Persona Chatbot (GLM-4-9B + LoRA)#

This module implements a private persona chatbot for WeChat group chats. It fine-tunes GLM-4-9B-Chat with LoRA on previous WeChat group chats history, then runs local inference on Windows and posts replies via GUI automation.

5.1) Goal + tooling choice#

  • Goal: generate persona-consistent replies in group chats; fine-tuning significantly reduced the default assistant-like tone, and the resulting style is much closer to the target’s voice.
  • Model: GLM-4-9B-Chat, fine-tuned with LoRA SFT and merged into a deployable checkpoint.
  • Constraint: chatbot runs in a separate mode due to GPU memory footprint (FP16 inference).

5.2) Dataset + formatting (high-level, anonymized)#

  • Training format: Alpaca-style JSONL with instruction / input / output.
  • Input: a fixed-size recent multi-speaker window (same structure as deployment).
  • Speaker isolation: nickname mapping → placeholders (e.g., <TARGET>, <U1>, <U2>) to reduce privacy leakage and “voice mixing.”
  • Scale: 34,677 cleaned training examples (from a subset of group chats).

Synthetic schema example for illustration:

{
  "instruction": "You are <TARGET>. Reply in <TARGET>'s tone and style.",
  "input": "<U1>: ...\n<TARGET>: ...\n<U2>: ...",
  "output": "..."
}
json

5.3) Fine-tuning approach#

  • Supervised fine-tuning with LoRA on Debian Linux.
  • Deployment uses a merged checkpoint for simpler inference.
  • The main quality driver is dataset coverage and structure (speaker isolation + consistent context formatting), not just model size.
  • Each training sample supervises a single next reply given a fixed-size multi-speaker window, which is effective for learning tone but can be brittle for long-range consistency.

5.4) Deployment policy (WeChat integration)#

  • Scope: runs only in whitelisted group chat(s) (not global auto-reply).
  • Context window: replies are generated from the last 12 messages (multi-speaker).
  • Debounce + jitter: the bot waits briefly and introduces randomized delays to avoid replying mid-burst and to reduce spam-like behavior.
  • Cold-silence follow-up (capped):
    • If the chat has been quiet for 8 seconds and the last message was from the bot/self, it may send a follow-up.
    • Follow-ups are capped at 4 consecutive self-talk attempts to prevent runaway behavior.
  • Decoding defaults: top_p=0.85, temperature=1.15, max_new_tokens=128, repetition_penalty=1.1.

5.5) Results#

MetricObserved rangeNotes
Training examples34,677cleaned Alpaca-style JSONL
Context strategylast 12 messagesfixed-size recent window
Speaker isolationplaceholdersnickname mapping → <TARGET>, <U1>, …
Tone fidelityStrong (subjective)style is consistent; limited drift observed
Chat-mode precisionFP16dominates VRAM budget
Chat-mode GPU VRAM~19.2 GBruns separately from call mode
Stability12+ hourscontinuous runtime tested
Reply policydebounce + jitter + capswhitelisted groups only

5.6) Failure modes + mitigations#

Failure modeWhat it looks likeLikely cause (high-level)Mitigation (method-level)
Tone-right / content-wrongstyle matches, but meaning becomes incoherent (rare)limited coverage from a subset of chats; weak groundingexpand dataset breadth; filter low-signal examples; add grounding later (Future Work: RAG)
Data-structure brittleness (context → single reply)good local replies, but shallow continuity across longer arcssupervision targets single-turn next reply; long-range memory isn’t representedimprove data organization (multi-turn targets, longer targets, topic continuity); future: add retrieval “memory system”
GUI automation fragilityoccasional missed or duplicated sendUI state is not a stable APIadd health checks; manual kill switch; periodic re-sync of chat state

5.7) Takeaway#

For persona chatbots, the bottleneck is data structure + speaker isolation + deployment policy in this project, not just model size (although larger parameter might help). In this project, the “voice” (tone/style) is already strong; the current ceiling is coverage and grounding. The most direct next steps are improving dataset breadth/way of organization and adding retrieval grounding (RAG) as a dedicated memory system to keep replies coherent beyond the context window.

6) Evaluation (Fidelity + Latency + Reliability)#

Because this project is private, evaluation is intentionally artifact-free and subjective, giving myself a scope to assess. I evaluate the system along two axes:

  1. Systems metrics (FPS, latency, VRAM, stability), and
  2. Human realism rubric (visual / voice / text) assessed qualitatively during real use.

6.1) Systems metrics#

StreamMetricObserved rangeNotes
Visual (Call Mode)Input resolution3840×2160Camera feed
Visual (Call Mode)Internal processing~1080pDownscaled for real-time
Visual (Call Mode)FPS~25–30 FPSScene / occlusion / motion dependent
Voice (Call Mode)End-to-end latency~0.3–0.5 sBuffer/routing + inference dominated
Call ModeGPU VRAM~8.9 GBFace + voice together
Chat ModeGPU VRAM~19.2 GBFP16 local inference
Chat ModeContext windowlast 12 messagesfixed-size recent window
SystemStability12+ hourscontinuous runtime tested

Key observation: perceived interaction latency is dominated by the voice conversion pipeline, not the video streaming.

6.2) Human realism rubric (qualitative)#

I score realism with a simple rubric that separates “sounds/looks right” from “behaves coherently over time”.

ComponentDimensionRatingWhat “good” means in practice
VisualStabilityHighminimal jitter; consistent alignment during normal motion
VisualTexture realismMediumskin-level details (pores/freckles) are the hardest to preserve
VisualRobustnessMediumfails under occlusion, fast motion, or low light
VoiceTimbre similarityHighvoice identity is convincing in typical speech
VoiceProsody/emotionMediumtraining speech domain is formal; emotional range is limited
VoiceLatencyMedium–Low~0.3–0.5 s is noticeable in conversation
TextTone/style fidelityHighfine-tuning reduces assistant-like tone; style is consistent
TextCoherenceMediumusually coherent; but have “tone-right/content-wrong” cases
TextLong-range consistencyMediumfixed-window supervision is weaker for long memory continuity

(Ratings are subjective, not a formal user study.)

6.3) What worked vs. what bottlenecked#

What worked best

  • Tone/style transfer (text): LoRA fine-tuning effectively reduced assistant-like tone and restore the original tone of speech of target.
  • Timbre conversion (voice): convincing identity under clean input and stable routing.
  • Real-time feasibility (visual): internal downscaling enabled stable FPS in live calls.

Primary bottlenecks

  • Visual realism ceiling: low light + real-time blending reduce skin-level texture fidelity.
  • Prosody mismatch: training audio is formal, limiting expressive speech.
  • Text grounding: multi-speaker context → single-reply supervision can drift semantically without a memory/grounding (RAG) layer.

6.4) Limitations (what this evaluation cannot claim)#

  • This is not a controlled user study, and there are no public artifacts to verify outputs.
  • Data coverage is limited (photo/audio/chat subset), so performance is uneven across edge cases.
  • The rubric prioritizes practical “call realism” over forensic realism, and results should be interpreted in that context.

7) Future Work and Roadmap#

This project is a working prototype, not a finished “product”. The next phase is about improving fidelity, robustness, and grounded consistency.

7.1) Roadmap (high-level)#

AreaUpgradeExpected impactEffortRisk / tradeoff
TextBroaden dataset coverage (more chats, more topics, more time range)Fewer incoherent cases; better opinion/knowledge coverageMedium–HighHigh (third-party privacy + cleaning complexity)
TextImprove data organization beyond “context → single reply”Better multi-turn continuity; less brittle long-arc behaviorMediumMedium (formatting choices can add noise)
TextRAG memory system (vector search over chat history)Grounded responses; better long-range consistency beyond context windowMediumMedium (must enforce strict retrieval filtering/redaction)
VoiceExpand training audio beyond formal speechBetter prosody/emotion; more natural casual speech (improves RVC quality)MediumMedium (data cleanliness + domain balance)
VoiceRetrain RVC for prosody + stabilityStronger “identity voice layer”; fewer artifacts when driven by TTS laterMedium–HighMedium (quality depends on data diversity)
VoiceLower-latency audio path (reduce buffering where possible)Faster turn-taking; improved “presence”MediumMedium (stability vs latency tradeoff)
SystemSpoken agent mode: LLM(+RAG) → TTS → RVC → WeChatAgent can speak autonomously while keeping target voice identityMedium–HighHigh (text coherence gating + additional latency)
VisualAvatar generation for calls (audio-driven talking head + lip-sync)Phases out face swap; improves controllability in callsHighHigh (temporal stability + latency constraints)
SystemFull avatar-call mode: spoken agent + avatar rendererCoherent “digital presence” across text/voice/videoHighHigh (end-to-end stability + governance)
Visual (V1 fallback)DFM / DeepFaceLab trainingHigher texture fidelity if face swap remains neededHighHigh (requires thousands of high-quality images)

7.2) What I would build next (concrete milestones)#

Milestone A — “Memory-first chatbot” (Text)

  • Export a broader message history.
  • Build a retrieval layer (RAG) that:
    • indexes only approved content,
    • retrieves with strict filtering (speaker/PII-aware),
    • feeds short, relevant snippets into the fixed-window context.
  • Re-evaluate coherence and long-range consistency.

Milestone B — “Identity Voice Layer” (RVC now, TTS later)

  • Collect more diverse speech domains (casual conversation, different emotions) to expand prosody coverage beyond formal speech.
  • Retrain RVC to improve prosody transfer and reduce artifacts (timbre is already strong; emotion is the ceiling).
  • Optimize the real-time path by experimenting with latency/stability tradeoffs (buffer size, routing configuration), while keeping call reliability acceptable.
  • Transition plan (gated): once the text agent is sufficiently coherent and grounded, replace the human mic input with TTS, and run:
    • LLM(+RAG) → TTS → RVC (identity voice) → virtual mic → WeChat call This keeps identity voice while allowing the agent to speak autonomously.

Milestone C — “Higher-fidelity face” (Visual)

  • If enough high-quality images can be obtained, train a DFM model for better texture and lighting robustness.
  • Replace face-swap with generation-based streaming:
    • avatar/face generation (and lip-sync) driven by the spoken agent.
  • This phases out face swap and voice conversion as the default call-mode path.

7.3) Guiding principle#

Future upgrades should preserve the project’s core constraints: private by design, local-only, and publishing-safe. For this kind of system, the hardest problems are not just model capability—they’re data governance and responsible deployment.

8) Conclusion#

Project Still is an end-to-end digital-human prototype built across three streams—real-time face swap, real-time voice conversion, and a persona chatbot—integrated into WeChat for real interaction.

But this pipeline would gradually be iterated to avatar generation,TTS to RVC, and the persona chatbot.

From an engineering perspective, the biggest lesson is that “high fidelity” is less about any single model and more about the weakest link in the pipeline:

  • Visual: real-time performance requires downscaling, and realism is often limited by lighting, occlusion, and skin-texture fidelity.
  • Voice: timbre conversion can be convincing, but perceived responsiveness is dominated by buffering/routing latency, and expressive emotion and prosody depends heavily on training data diversity.
  • Text: fine-tuning is effective for tone, but coherence and long-range consistency are ultimately gated by coverage and grounding (not only just parameters on the model).

Going forward, the roadmap is clear: expanding data coverage, improving dataset organization beyond “context → single reply,” and adding a retrieval-backed memory system (RAG) to keep responses grounded beyond the context window. On the visual side, avatar generation can replace current face-swap mechanism. On the audio side, diverse sources of audio datasets can raise the prosody diversity on the RVC model and then use TTS for audio output.

Project Still is private and will remain so. This post shares only architecture, metrics, and lessons learned for myself as a baseline to iterate and improve. If this is helpful for other builders to learn from the tradeoffs, I am also grateful.

May this project brings some light to the world.

And I can always feel my friend is still with me.

Still, always.

Project Still: Private Digital Legacy Stack
https://www.t1sun.com/blog/project_still
Author t1sun
Published at January 19, 2025
Comment seems to stuck. Try to refresh?✨