Experience – Rui Zhao

Research Projects

2026 – now

Used a bidirectional encoder to replace exploration through the denoising process by traditional gaussian action disturbance, ensuring fine-grained control over the action space.
Designed the algorithm to perturb actions only within recognized key chunks to improve exploration efficiency and reduce redundant rollouts.
Introduced world model to the training loop to provide sufficient data and accurate guidance.

2025 – now

Tried to remove the noises and reserve task-relevant or reward-relevant information in world model by making the output more compressed.
Designed a composite reward signal including jpeg compressibility, action prediction consistency (via a custom resnet inverse action model), and value preservation to guide a diffusion world model finetuning process.
Implemented PPO, DDPO, and AWM, adapting and unifying their codebases into a custom pipeline for diffusion world model fine-tuning, and conducted extensive experiments to iteratively refine the algorithms based on empirical results.
Focused on efficient and stable RL post-training on world model to improve long-horizon accuracy and downstream policy performance.

2025

Used the concepts in human cognitive psychology to design the reasoning process.
Constructed the cognitive graph / tree to represent the reasoning process and finding the better trajectory.
Used groups of samples with relative scores generated by DeepSeek R1 to train a smaller reward model.
Finetuned Qwen by RL to enhance general cognitive reasoning ability, not only on special tasks or datasets, and achieving better performance.

2024 – 2025

Proposed an expandable framework of KG construction Tree-KG, including initial construction and iterative expansion.
Achieved SOTA performance compared with existing methods such as GraphRAG, especially when using textbooks.
Paper accepted at ACL 2025 main conference, Patent number 202610098472.8.

Noetix Robotics, Beijing | 2026 – now

Doing research and development on embodied AI, including:

World Action Model Pretraining and Post-training.
VLA Post-training: Designing and reproducing memory system and hierarchical inference mechanisms, implementing supervised finetuning and reinforcement learning to train robots to do real-world long-horizon tasks.
Real-robot Deployment.

Tsinghua University | 2024 – 2025

Developed the core Meta Agent for the university’s AI education system.

Meta Agent Architecture: Based on OpenManus, designed a high-level agent capable of formulating reusable, generalizable workflows for diverse teaching tasks (e.g., grading, course design) from single instructions.
Automated Tool & Workflow Construction: Implemented a “Tool Maker” to synthesize Python scripts as new tools, and a “Workflow Maker” to organize operations into executable graphs, enabling the system to self-expand its capabilities.
Also worked on knowledge extraction: Extracting knowledge from textbooks and other materials to build a knowledge graph for the course.

Now AI Cosmos has been used in the university’s daily academic activities, including course learning and research assistance.