Skip to content

"Invalid Argument (80070057)" crash with Transformer-based models (D-FINE/RT-DETR) due to Int64 indices in Gather/Scatter ops #727

@LeoYang06

Description

@LeoYang06

Description

We are encountering a persistent E_INVALIDARG (80070057) crash when running D-FINE (RT-DETR based) ONNX models using DirectMLExecutionProvider on Windows. The same model runs perfectly on CpuExecutionProvider.

After extensive debugging and graph surgery, we identified the root cause as DirectML's incompatibility with Int64 indices in operators like Gather, ScatterND, TopK, and NonZero, which are prevalent in Transformer-based architectures exported from PyTorch.

Reproduction Steps

  1. Model: D-FINE (RT-DETR architecture) exported from PyTorch.

    • Contains dynamic shapes and Gather/Scatter ops with Int64 indices.
  2. Environment:

    • ONNX Runtime: 1.23.0
    • Provider: DirectML
    • OS: Windows 10/11
    • GPU: NVIDIA/Intel (Issue is backend-agnostic)
  3. Code:

    var options = new SessionOptions(); options.AppendExecutionProvider_DML(0); var session = new InferenceSession("dfine_model.onnx", options); // Crash here or during Run()

Observed Behavior

The initialization or execution fails with:

[E:onnxruntime:, inference_session.cc:2545] Exception during initialization: 
... \DmlExecutionProvider\src\MLOperatorAuthorImpl.cpp(2853) ... 
Exception(1) tid(...) 80070057

Analysis & Attempts

We have tried the following mitigation strategies, but none yielded a fully working graph due to conflicting constraints:

  1. Baseline (Float32 + Int64 Indices): Crashes with 80070057 on DirectML. (Works on CPU).
  2. Global Int64 -> Int32 Downcast:
    • We force-converted all Int64 tensors to Int32.
    • Result: Crashes ONNX Validator with [ErrorCode:InvalidGraph].
    • Reason: Operators like Reshape, Resize, and Expand mandate Int64 for their shape/scales inputs according to ONNX spec. Downcasting them makes the graph invalid.
  3. Selective Patching (The Deadlock):
    • We tried to cast only the inputs for Gather/Scatter to Int32, while keeping Reshape inputs as Int64.
    • Result: Type mismatch or Topology error.
    • Since PyTorch exports dynamic shape calculations (e.g., Shape -> Gather -> Concat -> Reshape), the data flow creates a dependency chain where a tensor must be Int64 for Reshape but Int32 for Gather. Inserting Casts breaks the constant folding or shape inference in complex subgraphs.

Request

DirectML seems to lag behind CPU implementation regarding Int64 support for indexing operators.Could the DirectML team:

  1. Native Support: Add native support for Int64 indices in Gather, Scatter, TopK, NonZero.
  2. Auto-Cast: Or implement an automatic graph optimization pass in DmlExecutionProvider to implicitly downcast Int64 indices to Int32 for supported operators, similar to how TensorRT handles mixed types.

This is a major blocker for deploying modern Transformer-based Vision models (DETR family) on Windows via DirectML.

Attachments:

  • Visual Studio Debug Log:
Unhandled exception. Microsoft.ML.OnnxRuntime.OnnxRuntimeException: [ErrorCode:Fail] Load model from E:\OpenSource\Github\YoloDotNet_Self\test\assets\Models\dfine_obj365_directml_nuclear.onnx failed:Type Error: Type (tensor(int64)) of output arg (unsqueeze) of node (node_unsqueeze) does not match expected type (tensor(int32)).
at Microsoft.ML.OnnxRuntime.InferenceSession.Init(String modelPath, SessionOptions options, PrePackedWeightsContainer prepackedWeightsContainer)
at Microsoft.ML.OnnxRuntime.InferenceSession..ctor(String modelPath, SessionOptions options)
at YoloDotNet.ExecutionProvider.DirectML.DirectMLExecutionProvider.InitializeYolo(Object model, Int32 gpuId) in E:\OpenSource\Github\YoloDotNet_Self\YoloDotNet.ExecutionProvider.DirectML\DirectMLExecutionProvider.cs:line 86
at YoloDotNet.ExecutionProvider.DirectML.DirectMLExecutionProvider..ctor(String model, Int32 gpuId, OnnxMetadataOverride metadataOverride) in E:\OpenSource\Github\YoloDotNet_Self\YoloDotNet.ExecutionProvider.DirectML\DirectMLExecutionProvider.cs:line 43
[E:onnxruntime:, inference_session.cc:2545 onnxruntime::InferenceSession::Initialize::<lambda_73d8de3ce9bc7d47058d99ebffb3c8e5>::operator ()] Exception during initialization: E:_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\MLOperatorAuthorImpl.cpp(2853)\onnxruntime.DLL!00007FFBADFEDC2C: (caller: 00007FFBADFFD699) Exception(1) tid(9eec) 80070057 Unhandled exception. Microsoft.ML.OnnxRuntime.OnnxRuntimeException: [ErrorCode:RuntimeException] Exception during initialization: E:_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\MLOperatorAuthorImpl.cpp(2853)\onnxruntime.DLL!00007FFBADFEDC2C: (caller: 00007FFBADFFD699) Exception(1) tid(9eec) 80070057 parameter error.
at Microsoft.ML.OnnxRuntime.InferenceSession.Init(String modelPath, SessionOptions options, PrePackedWeightsContainer prepackedWeightsContainer)
at Microsoft.ML.OnnxRuntime.InferenceSession..ctor(String modelPath, SessionOptions options)
at YoloDotNet.ExecutionProvider.DirectML.DirectMLExecutionProvider.InitializeYolo(Object model, Int32 gpuId) in E:\OpenSource\Github\YoloDotNet_Self\YoloDotNet.ExecutionProvider.DirectML\DirectMLExecutionProvider.cs:line 86
at YoloDotNet.ExecutionProvider.DirectML.DirectMLExecutionProvider..ctor(String model, Int32 gpuId, OnnxMetadataOverride metadataOverride) in E:\OpenSource\Github\YoloDotNet_Self\YoloDotNet.ExecutionProvider.DirectML\DirectMLExecutionProvider.cs:line 43
Unhandled exception. Microsoft.ML.OnnxRuntime.OnnxRuntimeException: [ErrorCode:InvalidGraph] Load model from E:\OpenSource\Github\YoloDotNet_Self\test\assets\Models\dfine_obj365_dml_global.onnx failed:This is an invalid model. Type Error: Type 'tensor(int32)' of input parameter (val_577) of operator (Reshape) in node (node_view) is invalid.
at Microsoft.ML.OnnxRuntime.InferenceSession.Init(String modelPath, SessionOptions options, PrePackedWeightsContainer prepackedWeightsContainer)
at Microsoft.ML.OnnxRuntime.InferenceSession..ctor(String modelPath, SessionOptions options)
at YoloDotNet.ExecutionProvider.DirectML.DirectMLExecutionProvider.InitializeYolo(Object model, Int32 gpuId) in E:\OpenSource\Github\YoloDotNet_Self\YoloDotNet.ExecutionProvider.DirectML\DirectMLExecutionProvider.cs:line 86
at YoloDotNet.ExecutionProvider.DirectML.DirectMLExecutionProvider..ctor(String model, Int32 gpuId, OnnxMetadataOverride metadataOverride) in E:\OpenSource\Github\YoloDotNet_Self\YoloDotNet.ExecutionProvider.DirectML\DirectMLExecutionProvider.cs:line 43

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions