InfiniTensor · chengtyao2-design · Jan 14, 2026
diff --git a/python/infinilm/models/qwen3moe/MOE_REFACTOR.md b/python/infinilm/models/qwen3moe/MOE_REFACTOR.md
@@ -0,0 +1,23 @@
+# Qwen3 MoE refactor (infinicore)
+
+This refactor introduces an infinicore version of the `Qwen3MoeSparseMoeBlock` at `python/infinilm/models/qwen3moe/qwen3moe.py`. The block now subclasses `infinicore.nn.Module`, stores weights as `infinicore.nn.Parameter`, and exposes the same class names as the Torch reference for drop-in usage.
+
+## What changed
+- Router uses `infinicore.nn.functional.linear` for the projection and a Python softmax+top-k shim to select experts, returning both scores and indices.
+- Experts run gate/up/down projections with the stored weights but rely on NumPy to emulate routing utilities (one-hot masks, scatter add) that are not present in infinicore yet.
+- The sparse MoE block returns `(hidden_states, routing_weights)` shaped back to `(batch, seq, hidden_dim)` and `(batch, seq, top_k)` respectively to surface the gate decisions.
+
+## Missing operators and temporary shims
+The following Torch ops are not available in infinicore today and are emulated in pure Python/NumPy inside `qwen3moe.py`:
+- Softmax over the expert dimension.
+- `topk` selection of experts.
+- One-hot expansion of expert indices.
+- Scatter/add (`index_add_`) to accumulate expert outputs.
+- Boolean masking utilities (`where`/`nonzero`) used for routing.
+
+All shims use `_tensor_to_numpy` to bridge an infinicore tensor to NumPy and `_from_numpy_like` to move results back while keeping device/dtype. Replace these with native infinicore kernels once they land to regain performance.
+
+## Notes and next steps
+- Activation currently supports `silu`/`swish`, `gelu`, and `relu`. Extend `_activation_fn` if the config uses other functions.
+- Weight initialization mirrors the Torch reference (`empty` for expert matrices, `zeros` for router weights); hook up a proper initializer if required.
+- When infinicore adds native softmax/top-k/one-hot/scatter, the Python shims can be deleted and the routing path can stay entirely on-device.
diff --git a/python/infinilm/models/qwen3moe/MOE_REFACTOR_CN.md b/python/infinilm/models/qwen3moe/MOE_REFACTOR_CN.md
@@ -0,0 +1,33 @@
+# Qwen3 MoE 重构说明（infinicore 版）
+
+本文记录如何将 `Qwen3MoeSparseMoeBlock` 从 torch 迁移到 infinicore 框架、缺失的算子列表，以及针对缺失算子的临时 Python/NumPy 实现方式。
+
+## 重构思路
+- **接口保持一致**：在 `python/infinilm/models/qwen3moe/qwen3moe.py` 中实现 `Qwen3MoeExperts`、`Qwen3MoeTopKRouter`、`Qwen3MoeSparseMoeBlock`，类名和调用方式与 torch 版本一致，便于替换。
+- **参数类型迁移**：专家权重、路由权重使用 `infinicore.nn.Parameter` 存储，并通过 `infinicore.empty/zeros` 创建，保持设备与 dtype 可配置。
+- **算子优先用 infinicore**：线性层调用 `infinicore.nn.functional.linear`，其余缺失的路由相关算子用 Python/NumPy 暂存。
+- **返回值保持形状**：MoE block 输出 `(batch, seq, hidden_dim)` 的混合结果，以及 `(batch, seq, top_k)` 的路由得分，方便对齐原有行为。
+
+## 缺失算子与临时实现
+当前 infinicore 不具备以下 torch 常用算子，均在 `qwen3moe.py` 内用纯 Python/NumPy 模拟：
+
+| 功能 | torch 对应 | 现状 | 临时方案 |
+| --- | --- | --- | --- |
+| Softmax | `torch.softmax` | 缺失 | `_softmax_np`：转 NumPy，按最后一维计算 softmax |
+| Top-K | `torch.topk` | 缺失 | `_topk_np`：`argpartition` 找前 k，再排序 |
+| One-Hot | `torch.nn.functional.one_hot` | 缺失 | `_one_hot_np`：`np.eye` 生成 |
+| Scatter/Add | `index_add_` | 缺失 | `np.add.at` 在 token 维度累加 |
+| Mask/筛选 | `where/nonzero` | 部分缺失 | 使用 `np.nonzero`/`np.where` 组合 |
+
+辅助函数 `_tensor_to_numpy`、`_from_numpy_like` 负责在 infinicore Tensor 与 NumPy 之间桥接，保持 dtype/device 一致；若底层增加直接转换接口，可移除这些桥接。
+
+## 关键模块说明
+- **Qwen3MoeTopKRouter**：对输入做一次线性投影（infinicore），随后用 NumPy softmax + top-k 得到路由得分与专家索引，可选归一化。
+- **Qwen3MoeExperts**：对命中专家的 token 做 gate/up 投影、激活（支持 `silu/swish`、`gelu`、`relu`），再 down 投影，并用 `np.add.at` 进行按 token 维度的累加。
+- **Qwen3MoeSparseMoeBlock**：展平 batch/seq 维喂入 router，拿到 `routing_weights` 和 `selected_experts` 后调用 experts 聚合，最后 reshape 回原始形状并返回路由得分。
+
+## 已知限制与后续优化
+- 路由路径使用 NumPy，暂时会有 CPU 往返与性能损失；待 infinicore 提供 softmax/top-k/one-hot/scatter 等算子后可彻底移除这些 Python 分支。
+- 激活函数目前覆盖 `silu/swish`、`gelu`、`relu`，若配置中包含其他激活需在 `_activation_fn` 扩展。
+- 权重初始化沿用 `empty/zeros`，如需与原模型严格对齐，可在加载或构建阶段补充初始化逻辑。
+- 建议后续补充单测，对比 torch 参考实现的输出形状与数值（在可用时）以确保兼容性。
diff --git a/python/infinilm/models/qwen3moe/__init__.py b/python/infinilm/models/qwen3moe/__init__.py
@@ -0,0 +1,35 @@
+import os
+from typing import Optional, Union
+import infinicore
+
+__all__ = ["AutoQwen3MOEModel"]
+
+
+class AutoQwen3MOEModel:
+    @classmethod
+    def from_pretrained(
+        cls,
+        model_path: Optional[Union[str, os.PathLike]],
+        device: infinicore.device,
+        dtype=infinicore.dtype,
+        backend="python",
+    ):
+        if backend == "python":
+            from . import modeling_qwen3moe
+
+            return modeling_qwen3moe.Qwen3MOE.from_pretrained(
+                model_path,
+                device=device,
+                dtype=dtype,
+            )
+
+        # elif backend == "cpp":
+        #     from .backends import cpp
+
+        #     return cpp.LlamaForCausalLM.from_pretrained(
+        #         model_path,
+        #         device=device,
+        #         dtype=dtype,
+        #     )
+
+        raise KeyError("invalid backend")