Start on GPU extensions (again) #320

kshyatt · 2025-12-01T10:29:20Z

#299 will be closed, was easier to just create a new branch and cherry-pick.

ext/TensorKitAMDGPUExt/TensorKitAMDGPUExt.jl

Jutho · 2025-12-02T22:21:51Z

ext/TensorKitAMDGPUExt/TensorKitAMDGPUExt.jl

+function ROCDiagonalTensorMap{T}(::UndefInitializer, V::TensorMapSpace) where {T}
+    (numin(V) == numout(V) == 1 && domain(V) == codomain(V)) ||
+        throw(ArgumentError("DiagonalTensorMap requires a space with equal domain and codomain and 2 indices"))
+    return ROCDiagonalTensorMap{T}(undef, domain(V))
+end
+function ROCDiagonalTensorMap{T}(::UndefInitializer, V::ProductSpace) where {T}
+    length(V) == 1 ||
+        throw(ArgumentError("DiagonalTensorMap requires `numin(d) == numout(d) == 1`"))
+    return ROCDiagonalTensorMap{T}(undef, only(V))
+end


We don't have these generalizations of the constructor for the normal (i.e. non-gpu) DiagonalTensorMap right? Is this useful?

Jutho · 2025-12-02T22:27:43Z

src/tensors/linalg.jl

-        maxS = maximum(first, values(S))
-        minS = minimum(last, values(S))
+        maxS = mapreduce(maximum, max, values(S))
+        minS = mapreduce(minimum, min, values(S))


Do first and last not work for the gpu arrays?

Unfortunately, no, as they require scalar indexing

Would it make sense to have allowscalar here instead? I don't know anything about these performances, but it seems hard to imagine that computing the max/min is faster than just taking the element

Unsure, we could test it. Generically usage of @allowscalar is extremely discouraged

A possibly better solution would be to rewrite this entire for-loop to use broadcasting

If we don't mind touching the internals of the DiagonalTensorMap we could also just take the minimum and maximum of the entire vector in one call as well

src/tensors/tensor.jl

ext/TensorKitAMDGPUExt/TensorKitAMDGPUExt.jl

ext/TensorKitAMDGPUExt/roctensormap.jl

Jutho · 2025-12-02T23:25:53Z

ext/TensorKitAMDGPUExt/roctensormap.jl

+function TensorKit.scalar(t::ROCTensorMap)
+
+    # TODO: should scalar only work if N₁ == N₂ == 0?
+    return @allowscalar dim(codomain(t)) == dim(domain(t)) == 1 ?
+        first(blocks(t))[2][1, 1] : throw(DimensionMismatch())
+end


This function is being somewhat rewritten in #291 so it might need to be revised if that PR is merged.

ext/TensorKitAMDGPUExt/roctensormap.jl

Jutho · 2025-12-02T23:29:50Z

Ok I finished going through the AMD extension; I guess several of the comments will also apply to the CUDA extension so I'll wait a bit before reviewing that as well.

kshyatt · 2025-12-04T18:27:26Z

The AMD code is mostly preliminary at this point as TensorOperations.jl doesn't support AMD. (yet!), but I agree that segment should be improved. I'll respond to comments soon :)

kshyatt · 2025-12-05T08:54:15Z

Thanks for the reviews so far! I've incorporated the feedback and am running the CUDA tests right now.

lkdvos

I went through the constructors for now, mostly with an eye of trying to minimize the amount of code duplication. Definitely feel free to discard some of these comments if that doesn't make too much sense, but I think there are some viable candidate ideas in there that would improve the overall robustness of TensorMap constructors/converters to begin with, and would simultaneously reduce the amount of boilerplate in the CUDA and AMD extensions

lkdvos · 2025-12-05T14:31:03Z

ext/TensorKitAMDGPUExt/roctensormap.jl

+function TensorKit.tensormaptype(S::Type{<:IndexSpace}, N₁, N₂, TorA::Type{<:StridedROCArray})
+    if TorA <: ROCArray
+        return TensorMap{eltype(TorA), S, N₁, N₂, ROCVector{eltype(TorA), AMDGPU.Mem.HIPBuffer}}
+    else
+        throw(ArgumentError("argument $TorA should specify a scalar type (`<:Number`) or a storage type `<:ROCVector{<:Number}`"))
+    end
+end


Could you maybe explain where this definition comes from? I would have guessed that if we ask for the tensormaptype with a given A, this function really should return the TensorMap{..., A}, which would be the default implementation but is overwritten here. I'm guessing I'm missing something, but it might be good to add a comment in that case to explain why this is necessary.

lkdvos · 2025-12-05T14:32:39Z

ext/TensorKitAMDGPUExt/roctensormap.jl

@@ -0,0 +1,257 @@
+const ROCTensorMap{T, S, N₁, N₂} = TensorMap{T, S, N₁, N₂, ROCVector{T, AMDGPU.Mem.HIPBuffer}}


I don't know much about the AMD ecosystem, but would there ever be a point to be less restrictive about this, and define this as follows, possibly with a const to determine what the allowed array types are?

Suggested change

const ROCTensorMap{T, S, N₁, N₂} = TensorMap{T, S, N₁, N₂, ROCVector{T, AMDGPU.Mem.HIPBuffer}}

const ROCTensorMap{T, S, N₁, N₂, A <: DenseRocVector{T}} = TensorMap{T, S, N₁, N₂, A}

lkdvos · 2025-12-05T14:44:08Z

ext/TensorKitAMDGPUExt/roctensormap.jl

+function ROCTensorMap{T}(::UndefInitializer, V::TensorMapSpace{S, N₁, N₂}) where {T, S, N₁, N₂}
+    return ROCTensorMap{T, S, N₁, N₂}(undef, V)
+end


Thinking out loud here:
Would it also work if we rewrite the undef constructors logic in tensors.jl in a pattern like the following, using an additional intermediate step that selects the storage type, rather than doing that all in one go?

const _TensorMap{T, A <: DenseVector{T}, S, N1, N2} = TensorMap{T, S, N1, N2, A} const DefaultTensorData{T} = Vector{T} TensorMap{T}(::UndefInitializer, V::TensorMapSpace) where {T} = _TensorMap{T, DefaultTensorData{T}}(undef, V) _TensorMap{T, A}(::UndefInitializer, V::TensorMapSpace) where {T, A} = TensorMap{T, spacetype(V), numout(V), numin(V), A}(undef, V)

and similar constructions for the other ones? This might avoid having to duplicate these constructors for AMD and CUDA, and could also pave the way for experiments with using TensorMap{T, S, N1, N2, Memory{T}}, which I'm guessing has very similar kinds of demands (and could be used to test the constructors etc without having to rely on buildkite?)

lkdvos · 2025-12-05T14:48:18Z

ext/TensorKitAMDGPUExt/roctensormap.jl

+Alternatively, the domain and codomain can be specified by passing a [`HomSpace`](@ref)
+using the syntax `codomain ← domain` or `domain → codomain`.
+"""
+function ROCTensorMap(


Double comment here:

I think we probably also want TensorMap(::SectorDict{I,<:ROCMatrix}, ...) to return a RocTensorMap, which I don't think is currently happening.

We might also want RocTensorMap(::SectorDict{I,<:AbstractMatrix}) to automatically promote to ROCVector?

Given this, and my previous comment, would it make sense to alter this constructor to first attempt to figure out a type A, and then call TensorMap{T,...,A}(data) which then enforces the data type?

lkdvos · 2025-12-05T14:48:28Z

ext/TensorKitAMDGPUExt/roctensormap.jl

+using the syntax `codomain ← domain` or `domain → codomain`.
+"""
+function ROCTensorMap(
+        data::AbstractDict{<:Sector, <:ROCArray},


Suggested change

data::AbstractDict{<:Sector, <:ROCArray},

data::AbstractDict{<:Sector, <:ROCMatrix},

I agree, but I think we want to alter the TensorMap(data::AbstractDict{<:Sector, <:AbstractMatrix}, ...) for that, in such a way that we indeed first determine a suitable A<:DenseMatrix associated with the types in the values(data) list.

with the caveat that A <: DenseVector 😉

Potato potato

true, although it can be slightly tricky to find a generic implementation of a function that takes in M <: AbstractMatrix and spits out A <: AbstractVector that corresponds to this, which is more or less why I brought it up

Absolutely, I have been struggling with that before. This was just my funny Friday evening comment.

lkdvos · 2025-12-05T19:01:35Z

ext/TensorKitAMDGPUExt/roctensormap.jl

+        return copy!(tnew, t)
+    end
+end
+


Is this not covered by the generic

TensorKit.jl/src/tensors/tensor.jl

Lines 525 to 534 in a93edc6

function Base.convert(

TT::Type{TensorMap{T, S, N₁, N₂, A}}, t::AbstractTensorMap{<:Any, S, N₁, N₂}

) where {T, S, N₁, N₂, A}

if typeof(t) === TT

return t

else

tnew = TT(undef, space(t))

return copy!(tnew, t)

end

end

lkdvos · 2025-12-05T19:04:38Z

ext/TensorKitAMDGPUExt/roctensormap.jl

+function Base.copy!(tdst::ROCTensorMap{T, S, N₁, N₂}, tsrc::ROCTensorMap{T, S, N₁, N₂}) where {T, S, N₁, N₂}
+    space(tdst) == space(tsrc) || throw(SpaceMismatch("$(space(tdst)) ≠ $(space(tsrc))"))
+    for ((c, bdst), (_, bsrc)) in zip(blocks(tdst), blocks(tsrc))
+        copy!(bdst, bsrc)
+    end
+    return tdst


I'm guessing this is here to avoid going through the StridedView path which doesn't work on GPU? It might be reasonable to just remove the StridedView method, since this historically was there before everything was in a single large vector in TensorMap, and I would say this is a bit too opinionated for a generic AbstractTensorMap implementation.

Additionally, does this not coincide with the Base.copy!(::TensorMap, ::TensorMap) definition?

lkdvos · 2025-12-05T19:05:09Z

ext/TensorKitAMDGPUExt/roctensormap.jl

+    return tdst
+end
+
+function Base.copy!(tdst::ROCTensorMap, tsrc::TensorKit.AdjointTensorMap)


same comment about possibly just having this implementation as the copy!(::AbstractTensorMap, ::AbstractTensorMap) implementation

lkdvos · 2025-12-05T19:06:38Z

ext/TensorKitAMDGPUExt/roctensormap.jl

+# Conversion to ROCArray:
+#----------------------
+# probably not optimized for speed, only for checking purposes
+function Base.convert(::Type{ROCArray}, t::AbstractTensorMap)


Can we rewrite the base version to take Type{A} where {A <: Array}, rather than hardcoding Array there?

lkdvos · 2025-12-05T19:08:17Z

ext/TensorKitAMDGPUExt/TensorKitAMDGPUExt.jl

+
+function TensorKit.Factorizations.MAK.initialize_output(::typeof(svd_vals!), t::ROCTensorMap, alg::AbstractAlgorithm)
+    V_cod = infimum(fuse(codomain(t)), fuse(domain(t)))
+    return ROCDiagonalTensorMap{real(scalartype(t))}(undef, V_cod)


We could consider introducing a similar_diagonal(t, T) function/hook in the base implementations, such that we could overload just that function and avoid having to copy the boilerplate here

[edit] I hadn't seen you already more or less did this for the base implementation, in which case these functions might not be necessary anymore?

kshyatt requested a review from lkdvos December 1, 2025 10:29

kshyatt force-pushed the ksh/gpu3 branch 2 times, most recently from 2c9b2d3 to 45ba5dc Compare December 1, 2025 10:40

Jutho reviewed Dec 2, 2025

View reviewed changes

ext/TensorKitAMDGPUExt/TensorKitAMDGPUExt.jl Outdated Show resolved Hide resolved

Jutho reviewed Dec 2, 2025

View reviewed changes

src/tensors/tensor.jl Outdated Show resolved Hide resolved

Jutho reviewed Dec 2, 2025

View reviewed changes

src/tensors/tensor.jl Show resolved Hide resolved

Jutho reviewed Dec 2, 2025

View reviewed changes

ext/TensorKitAMDGPUExt/TensorKitAMDGPUExt.jl Show resolved Hide resolved

lkdvos reviewed Dec 2, 2025

View reviewed changes

ext/TensorKitAMDGPUExt/roctensormap.jl Outdated Show resolved Hide resolved

Jutho reviewed Dec 2, 2025

View reviewed changes

ext/TensorKitAMDGPUExt/roctensormap.jl Outdated Show resolved Hide resolved

Jutho reviewed Dec 2, 2025

View reviewed changes

ext/TensorKitAMDGPUExt/roctensormap.jl Outdated Show resolved Hide resolved

kshyatt force-pushed the ksh/gpu3 branch 7 times, most recently from 95f91a7 to 8bbc3ff Compare December 5, 2025 13:10

Start on GPU extensions

f87a573

kshyatt force-pushed the ksh/gpu3 branch from 8bbc3ff to f87a573 Compare December 5, 2025 14:25

lkdvos reviewed Dec 5, 2025

View reviewed changes

kshyatt mentioned this pull request Dec 8, 2025

Start on CUDA extension #325

Open

		@@ -0,0 +1,257 @@
		const ROCTensorMap{T, S, N₁, N₂} = TensorMap{T, S, N₁, N₂, ROCVector{T, AMDGPU.Mem.HIPBuffer}}

	const ROCTensorMap{T, S, N₁, N₂} = TensorMap{T, S, N₁, N₂, ROCVector{T, AMDGPU.Mem.HIPBuffer}}
	const ROCTensorMap{T, S, N₁, N₂, A <: DenseRocVector{T}} = TensorMap{T, S, N₁, N₂, A}

	data::AbstractDict{<:Sector, <:ROCArray},
	data::AbstractDict{<:Sector, <:ROCMatrix},

	function Base.convert(
	TT::Type{TensorMap{T, S, N₁, N₂, A}}, t::AbstractTensorMap{<:Any, S, N₁, N₂}
	) where {T, S, N₁, N₂, A}
	if typeof(t) === TT
	return t
	else
	tnew = TT(undef, space(t))
	return copy!(tnew, t)
	end
	end

Start on GPU extensions (again) #320

Are you sure you want to change the base?

Start on GPU extensions (again) #320

Uh oh!

Conversation

kshyatt commented Dec 1, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Jutho commented Dec 2, 2025

Uh oh!

kshyatt commented Dec 4, 2025

Uh oh!

kshyatt commented Dec 5, 2025

Uh oh!

lkdvos left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!