-
Notifications
You must be signed in to change notification settings - Fork 56
Start on GPU extensions (again) #320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
2c9b2d3 to
45ba5dc
Compare
| function ROCDiagonalTensorMap{T}(::UndefInitializer, V::TensorMapSpace) where {T} | ||
| (numin(V) == numout(V) == 1 && domain(V) == codomain(V)) || | ||
| throw(ArgumentError("DiagonalTensorMap requires a space with equal domain and codomain and 2 indices")) | ||
| return ROCDiagonalTensorMap{T}(undef, domain(V)) | ||
| end | ||
| function ROCDiagonalTensorMap{T}(::UndefInitializer, V::ProductSpace) where {T} | ||
| length(V) == 1 || | ||
| throw(ArgumentError("DiagonalTensorMap requires `numin(d) == numout(d) == 1`")) | ||
| return ROCDiagonalTensorMap{T}(undef, only(V)) | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have these generalizations of the constructor for the normal (i.e. non-gpu) DiagonalTensorMap right? Is this useful?
| maxS = maximum(first, values(S)) | ||
| minS = minimum(last, values(S)) | ||
| maxS = mapreduce(maximum, max, values(S)) | ||
| minS = mapreduce(minimum, min, values(S)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do first and last not work for the gpu arrays?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, no, as they require scalar indexing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to have allowscalar here instead? I don't know anything about these performances, but it seems hard to imagine that computing the max/min is faster than just taking the element
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unsure, we could test it. Generically usage of @allowscalar is extremely discouraged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A possibly better solution would be to rewrite this entire for-loop to use broadcasting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't mind touching the internals of the DiagonalTensorMap we could also just take the minimum and maximum of the entire vector in one call as well
| function TensorKit.scalar(t::ROCTensorMap) | ||
|
|
||
| # TODO: should scalar only work if N₁ == N₂ == 0? | ||
| return @allowscalar dim(codomain(t)) == dim(domain(t)) == 1 ? | ||
| first(blocks(t))[2][1, 1] : throw(DimensionMismatch()) | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is being somewhat rewritten in #291 so it might need to be revised if that PR is merged.
|
Ok I finished going through the AMD extension; I guess several of the comments will also apply to the CUDA extension so I'll wait a bit before reviewing that as well. |
|
The AMD code is mostly preliminary at this point as |
|
Thanks for the reviews so far! I've incorporated the feedback and am running the CUDA tests right now. |
95f91a7 to
8bbc3ff
Compare
lkdvos
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through the constructors for now, mostly with an eye of trying to minimize the amount of code duplication. Definitely feel free to discard some of these comments if that doesn't make too much sense, but I think there are some viable candidate ideas in there that would improve the overall robustness of TensorMap constructors/converters to begin with, and would simultaneously reduce the amount of boilerplate in the CUDA and AMD extensions
| function TensorKit.tensormaptype(S::Type{<:IndexSpace}, N₁, N₂, TorA::Type{<:StridedROCArray}) | ||
| if TorA <: ROCArray | ||
| return TensorMap{eltype(TorA), S, N₁, N₂, ROCVector{eltype(TorA), AMDGPU.Mem.HIPBuffer}} | ||
| else | ||
| throw(ArgumentError("argument $TorA should specify a scalar type (`<:Number`) or a storage type `<:ROCVector{<:Number}`")) | ||
| end | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you maybe explain where this definition comes from? I would have guessed that if we ask for the tensormaptype with a given A, this function really should return the TensorMap{..., A}, which would be the default implementation but is overwritten here. I'm guessing I'm missing something, but it might be good to add a comment in that case to explain why this is necessary.
| @@ -0,0 +1,257 @@ | |||
| const ROCTensorMap{T, S, N₁, N₂} = TensorMap{T, S, N₁, N₂, ROCVector{T, AMDGPU.Mem.HIPBuffer}} | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know much about the AMD ecosystem, but would there ever be a point to be less restrictive about this, and define this as follows, possibly with a const to determine what the allowed array types are?
| const ROCTensorMap{T, S, N₁, N₂} = TensorMap{T, S, N₁, N₂, ROCVector{T, AMDGPU.Mem.HIPBuffer}} | |
| const ROCTensorMap{T, S, N₁, N₂, A <: DenseRocVector{T}} = TensorMap{T, S, N₁, N₂, A} |
| function ROCTensorMap{T}(::UndefInitializer, V::TensorMapSpace{S, N₁, N₂}) where {T, S, N₁, N₂} | ||
| return ROCTensorMap{T, S, N₁, N₂}(undef, V) | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking out loud here:
Would it also work if we rewrite the undef constructors logic in tensors.jl in a pattern like the following, using an additional intermediate step that selects the storage type, rather than doing that all in one go?
const _TensorMap{T, A <: DenseVector{T}, S, N1, N2} = TensorMap{T, S, N1, N2, A}
const DefaultTensorData{T} = Vector{T}
TensorMap{T}(::UndefInitializer, V::TensorMapSpace) where {T} =
_TensorMap{T, DefaultTensorData{T}}(undef, V)
_TensorMap{T, A}(::UndefInitializer, V::TensorMapSpace) where {T, A} =
TensorMap{T, spacetype(V), numout(V), numin(V), A}(undef, V)and similar constructions for the other ones? This might avoid having to duplicate these constructors for AMD and CUDA, and could also pave the way for experiments with using TensorMap{T, S, N1, N2, Memory{T}}, which I'm guessing has very similar kinds of demands (and could be used to test the constructors etc without having to rely on buildkite?)
| Alternatively, the domain and codomain can be specified by passing a [`HomSpace`](@ref) | ||
| using the syntax `codomain ← domain` or `domain → codomain`. | ||
| """ | ||
| function ROCTensorMap( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double comment here:
- I think we probably also want
TensorMap(::SectorDict{I,<:ROCMatrix}, ...)to return aRocTensorMap, which I don't think is currently happening. - We might also want
RocTensorMap(::SectorDict{I,<:AbstractMatrix})to automatically promote toROCVector?
Given this, and my previous comment, would it make sense to alter this constructor to first attempt to figure out a type A, and then call TensorMap{T,...,A}(data) which then enforces the data type?
| using the syntax `codomain ← domain` or `domain → codomain`. | ||
| """ | ||
| function ROCTensorMap( | ||
| data::AbstractDict{<:Sector, <:ROCArray}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| data::AbstractDict{<:Sector, <:ROCArray}, | |
| data::AbstractDict{<:Sector, <:ROCMatrix}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but I think we want to alter the TensorMap(data::AbstractDict{<:Sector, <:AbstractMatrix}, ...) for that, in such a way that we indeed first determine a suitable A<:DenseMatrix associated with the types in the values(data) list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with the caveat that A <: DenseVector 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potato potato
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true, although it can be slightly tricky to find a generic implementation of a function that takes in M <: AbstractMatrix and spits out A <: AbstractVector that corresponds to this, which is more or less why I brought it up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely, I have been struggling with that before. This was just my funny Friday evening comment.
| return copy!(tnew, t) | ||
| end | ||
| end | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this not covered by the generic
TensorKit.jl/src/tensors/tensor.jl
Lines 525 to 534 in a93edc6
| function Base.convert( | |
| TT::Type{TensorMap{T, S, N₁, N₂, A}}, t::AbstractTensorMap{<:Any, S, N₁, N₂} | |
| ) where {T, S, N₁, N₂, A} | |
| if typeof(t) === TT | |
| return t | |
| else | |
| tnew = TT(undef, space(t)) | |
| return copy!(tnew, t) | |
| end | |
| end |
| function Base.copy!(tdst::ROCTensorMap{T, S, N₁, N₂}, tsrc::ROCTensorMap{T, S, N₁, N₂}) where {T, S, N₁, N₂} | ||
| space(tdst) == space(tsrc) || throw(SpaceMismatch("$(space(tdst)) ≠ $(space(tsrc))")) | ||
| for ((c, bdst), (_, bsrc)) in zip(blocks(tdst), blocks(tsrc)) | ||
| copy!(bdst, bsrc) | ||
| end | ||
| return tdst |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing this is here to avoid going through the StridedView path which doesn't work on GPU? It might be reasonable to just remove the StridedView method, since this historically was there before everything was in a single large vector in TensorMap, and I would say this is a bit too opinionated for a generic AbstractTensorMap implementation.
Additionally, does this not coincide with the Base.copy!(::TensorMap, ::TensorMap) definition?
| return tdst | ||
| end | ||
|
|
||
| function Base.copy!(tdst::ROCTensorMap, tsrc::TensorKit.AdjointTensorMap) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment about possibly just having this implementation as the copy!(::AbstractTensorMap, ::AbstractTensorMap) implementation
| # Conversion to ROCArray: | ||
| #---------------------- | ||
| # probably not optimized for speed, only for checking purposes | ||
| function Base.convert(::Type{ROCArray}, t::AbstractTensorMap) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rewrite the base version to take Type{A} where {A <: Array}, rather than hardcoding Array there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can try!
|
|
||
| function TensorKit.Factorizations.MAK.initialize_output(::typeof(svd_vals!), t::ROCTensorMap, alg::AbstractAlgorithm) | ||
| V_cod = infimum(fuse(codomain(t)), fuse(domain(t))) | ||
| return ROCDiagonalTensorMap{real(scalartype(t))}(undef, V_cod) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could consider introducing a similar_diagonal(t, T) function/hook in the base implementations, such that we could overload just that function and avoid having to copy the boilerplate here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[edit] I hadn't seen you already more or less did this for the base implementation, in which case these functions might not be necessary anymore?
#299 will be closed, was easier to just create a new branch and
cherry-pick.