Yushun Zhang

This article explores optimization challenges in Transformers and the effectiveness of Adam over SGD.

2025-09-03T21:48:54+00:00 ― 6 min read

Adam-mini reduces memory usage for training large language models while maintaining performance.

2025-07-24T23:19:06+00:00 ― 6 min read

MoFO helps large language models retain knowledge during fine-tuning without losing performance.

2025-07-05T01:30:00+00:00 ― 5 min read

Discover efficient algorithm performance under strict time limits.

2025-01-20T09:15:40+00:00 ― 7 min read