Skip to content

Conversation

@swamirishi
Copy link
Contributor

@swamirishi swamirishi commented Dec 24, 2025

What changes were proposed in this pull request?

As per rocksdb DirectSlices are supposed to expose the direct buffers to interact with rocksdb. The only problem with directSlices when working with external byte buffers it doesn't understand offset and limit of the ByteBuffer. Hence to make it safe to work just creating a constructor for ManagedDirectSlice fixing the offset and length of the underlying directSlice instance should be good. The current implementation doesn't make it safe to reuse the direct slice multiple times since rocksdb doesn't allow rollbacks of offset in case we want to use a different offset and limit with the same byte buffer. This would also make this class to be used easily in other places as well.
Once this patch in rocksdb gets merged, we can get away from using DirectSlices for writing into the rocksdb altogether and just use it for read purposes.
facebook/rocksdb#14196

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14237

How was this patch tested?

Update existing unit tests and removed irrelevant ones

@swamirishi swamirishi force-pushed the HDDS-14237 branch 3 times, most recently from efd7932 to 41275d6 Compare December 24, 2025 19:30
@swamirishi swamirishi requested a review from szetszwo December 25, 2025 00:13
Change-Id: I1cc243d0dadbe647ad9dc7b94822ba7dc6a83e1c
public class ManagedDirectSlice extends ManagedObject<DirectSlice> {

private final ByteBuffer data;
public class ManagedDirectSlice extends DirectSlice {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@swamirishi , What is the typical key size? Rocksdb javadoc says,

... When using smaller keys and values consider using org.rocksdb.Slice

See https://javadoc.io/static/org.rocksdb/rocksdbjni/10.4.2/org/rocksdb/DirectSlice.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using direct Slices for cases where the data is already present in a direct byte buffer to avoid unnecessary copy from off heap -> heap -> back to off heap memory. When we get CodecBuffer we are using direct byte buffers. We would be reusing the same ByteBuffer for even put operations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An FSO key would be typically longer than 40 bytes and OBS key can be even longer so we would see great improvement for such cases.

Copy link
Contributor Author

@swamirishi swamirishi Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code path for Slice and DirectSlice is the same on the native side we end up initializing a Slice* pointer. Now the factor comes to the place whether we want to avoid these redundant buffer copies or not. DirectSlices are going to be always more optimal than Slice it is just the extent to which it is going to give a performance bump. For smaller keys it wouldn't be much but for keys > 20 - 30 bytes it would be apparent and for longer keys even more so.

Copy link
Contributor Author

@swamirishi swamirishi Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look at the code here:
DirectSlice initialization : https://github.com/facebook/rocksdb/blob/0bf9079d44eea91afda7151306d3a3439a39511b/java/rocksjni/slice.cc#L292-L308

Slice initialization:
https://github.com/facebook/rocksdb/blob/0bf9079d44eea91afda7151306d3a3439a39511b/java/rocksjni/slice.cc#L130-L147

If you look at the code direct slice initialization just creates a slice object by pointing to the already present directByteBuffer no copies. Whereas the Slice initialization has copy from jbyte array to direct off heap memory and then create a Slice object. The code execution just varies on the buffer copy. So directSlice would always beat the Slice performance. Now it comes to the overhead by how much which depends on the key length.
In our case since we already have a Direct CodecBuffer which was read from rocksdb there is no copying required. Even when we are writing some entries into the rocksdb we anyways have to copy to a off heap memory. So we can avoid the intermediate allocation of byte array and avoid GC and CPU cycle wastage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... but for keys > 20 - 30 bytes it would be apparent and for longer keys even more so.

How do you come up with > 20 - 30 bytes? It is unbelievable that 20 bytes and 40 bytes could make a difference in modern computers.

Copy link
Contributor Author

@swamirishi swamirishi Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the benchmark code I wrote:

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements. See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License. You may obtain a copy of the License at
 *
 *      http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.hadoop.hdds.utils.db;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import org.apache.commons.lang3.RandomUtils;
import org.apache.hadoop.hdds.utils.db.managed.ManagedDirectSlice;
import org.apache.hadoop.hdds.utils.db.managed.ManagedRocksObjectUtils;
import org.apache.hadoop.hdds.utils.db.managed.ManagedSlice;
import org.junit.jupiter.api.BeforeAll;
import org.junit.jupiter.params.ParameterizedTest;
import org.junit.jupiter.params.provider.ValueSource;

public class TestRDBBatchOperation {

  @BeforeAll
  public static void loadLibrary() {
    ManagedRocksObjectUtils.loadRocksDBLibrary();
  }

  @ParameterizedTest
  @ValueSource(ints = {1, 10, 20, 40, 100, 1000})
  public void testManagedDirectSliceCreate(int numberOfBytesWritten) throws CodecException {
    int numberOfKeys = 1000000;
    byte[] randomBytes = RandomUtils.secure().nextBytes(numberOfBytesWritten);
    List<CodecBuffer> buffers = new ArrayList<>(numberOfKeys);
    for (int i = 0; i < numberOfKeys; i++) {
      buffers.add(CodecBufferCodec.get(true).fromPersistedFormat(randomBytes));
    }
    List<ManagedDirectSlice> slices = new ArrayList<>(numberOfKeys);
    long startTime = System.currentTimeMillis();
    for (CodecBuffer buffer : buffers) {
      slices.add(new ManagedDirectSlice(buffer.asReadOnlyByteBuffer()));
    }
    long totalTime = System.currentTimeMillis() - startTime;
    System.out.println("Time taken to create ManagedDirectSlice: " + totalTime + " ms with key length: " + numberOfBytesWritten);
    slices.forEach(ManagedDirectSlice::close);
    buffers.forEach(CodecBuffer::close);
  }

  @ParameterizedTest
  @ValueSource(ints = {1, 10, 20, 40, 100, 1000})
  public void testManagedSliceCreate(int numberOfBytesWritten) throws CodecException {
    int numberOfKeys = 1000000;
    byte[] randomBytes = RandomUtils.secure().nextBytes(numberOfBytesWritten);
    List<byte[]> buffers = new ArrayList<>(numberOfKeys);
    for (int i = 0; i < numberOfKeys; i++) {
      buffers.add(Arrays.copyOf(randomBytes, randomBytes.length));
    }
    List<ManagedSlice> slices = new ArrayList<>(numberOfKeys);
    long startTime = System.currentTimeMillis();
    for (byte[] buffer : buffers) {
      slices.add(new ManagedSlice(buffer));
    }
    long totalTime = System.currentTimeMillis() - startTime;
    System.out.println("Time taken to create Slice: " + totalTime + " ms with key length: " + numberOfBytesWritten);
    slices.forEach(ManagedSlice::close);
  }
}

Result for creating 1 million Slices and DirectSlices:

Time taken to create ManagedDirectSlice: 264 ms with key length: 1
Time taken to create ManagedDirectSlice: 170 ms with key length: 10
Time taken to create ManagedDirectSlice: 250 ms with key length: 20
Time taken to create ManagedDirectSlice: 227 ms with key length: 40
Time taken to create ManagedDirectSlice: 169 ms with key length: 100
Time taken to create ManagedDirectSlice: 270 ms with key length: 1000
Time taken to create Slice: 659 ms with key length: 1
Time taken to create Slice: 496 ms with key length: 10
Time taken to create Slice: 499 ms with key length: 20
Time taken to create Slice: 517 ms with key length: 40
Time taken to create Slice: 517 ms with key length: 100
Time taken to create Slice: 719 ms with key length: 1000

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you use Xms? The heap keeps resizing.

Copy link
Contributor

@szetszwo szetszwo Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • numberOfKeys = 100_000
  • -Xms8G -Xmx8G
Time taken to create ManagedDirectSlice: 1985 ms with key length: 1
Time taken to create ManagedDirectSlice: 1934 ms with key length: 10
Time taken to create ManagedDirectSlice: 1965 ms with key length: 20
Time taken to create ManagedDirectSlice: 1926 ms with key length: 40
Time taken to create ManagedDirectSlice: 1949 ms with key length: 100
Time taken to create ManagedDirectSlice: 1952 ms with key length: 1000
Time taken to create Slice: 1894 ms with key length: 1
Time taken to create Slice: 1920 ms with key length: 10
Time taken to create Slice: 1979 ms with key length: 20
Time taken to create Slice: 1855 ms with key length: 40
Time taken to create Slice: 1913 ms with key length: 100
Time taken to create Slice: 1976 ms with key length: 1000

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@swamirishi , thanks for working on this! Please see the comment inlined.

Comment on lines 23 to 42
/**
* ManagedDirectSlice is a managed wrapper around the DirectSlice object. It ensures
* proper handling of native resources associated with DirectSlice, utilizing
* the ManagedObject infrastructure to prevent resource leaks. It works in tandem
* with a ByteBuffer, which acts as the data source for the managed slice.
* ManagedDirectSlice is a class that extends the {@link DirectSlice} class and provides additional
* management for slices of direct {@link ByteBuffer} memory. This class initializes the slice with
* the given ByteBuffer and sets its prefix and length properties based on the buffer's position
* and remaining capacity.
*
* The class is designed to handle specific memory slicing operations while ensuring that the
* provided ByteBuffer’s constraints are respected. ManagedDirectSlice leverages its parent
* {@link DirectSlice} functionalities to deliver optimized direct buffer handling.
*
* Constructor:
* - Initializes the ManagedDirectSlice instance with a provided ByteBuffer.
* - Sets the slice length to the buffer's remaining capacity.
* - Removes the prefix based on the buffer's position.
*
* This class overrides certain operations to tightly control the lifecycle and
* behavior of the DirectSlice it manages. It specifically caters to use cases
* where the slice is used in RocksDB operations, providing methods for safely
* interacting with the slice for put-like operations.
* NOTE: This class should be only with ByteBuffer whose position and limit is going be immutable in the lifetime of
* this ManagedDirectSlice instance. This means that the ByteBuffer's position and limit should not be modified
* externally while the ManagedDirectSlice is in use. The value in the byte buffer should be only accessed via the
* instance.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ManagedDirectSlice is quite empty. Remove the javadoc except for the following sentence.

/**
 * Managed {@link DirectSlice} for leak detection.
 */

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 46 to 49
super(data);
this.removePrefix(data.position());
this.setLength(data.remaining());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use slice(). (Please check if RocksDb is already calling slice().)

public class ManagedDirectSlice extends DirectSlice {
  // TODO Add leakTracker

  public ManagedDirectSlice(ByteBuffer buffer) {
    super(Objects.requireNonNull(buffer, "buffer == null").slice().asReadOnlyBuffer());
  }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would still need to set the length since DirectSlice initialize just sets the pointer to the base address of the byte buffer passed but not the size of the slice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW do you think we need to create another ByteBuffer object using slice unnecessarily? The ByteBuffer object would be anyways forgotten and only offset and length is kept on the native side

Copy link
Contributor Author

@swamirishi swamirishi Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would personally prefer the previous approach which doesn't create a new object. The slice buffer object created would be put up for gc right away after the super constructor call

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slice buffer is just a wrapper but not copying the slice. Are you using Suppiler, Function, Collection.stream()?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, it looks like that removePrefix(..) is more expensive than slice.

Change-Id: Iecd9b24999262f3b57ab83b9f94499de88d912d2
Change-Id: Iff2fdf76de5c2d23091ca221b9300c9b80aa6f6b
Change-Id: Ica181c197cbde253de6611b5d70d77dd1c803b7b
Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 the change looks good.

Just a very minor javadoc comment. We may address it in the new PR.

Comment on lines 33 to 35
* NOTE: This class should be only with ByteBuffer whose position and limit is going be immutable in the lifetime of
* this ManagedDirectSlice instance. This means that the ByteBuffer's position and limit should not be modified
* externally while the ManagedDirectSlice is in use.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are calling slice(), this javadoc can be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Change-Id: Iee329b8aad3b10bc3824d4500e8cf1f47d1881b9
@swamirishi swamirishi marked this pull request as ready for review December 27, 2025 01:12
@swamirishi swamirishi merged commit 3297198 into apache:master Dec 27, 2025
55 checks passed
@swamirishi
Copy link
Contributor Author

Thank you @szetszwo for reviewing the patch

Comment on lines 38 to +39
@CsvSource({"0, 1024", "1024, 1024", "512, 1024", "0, 100", "10, 512", "0, 0"})
public void testManagedDirectSliceWithOffsetMovedAheadByteBuffer(int offset, int numberOfBytesWritten)
throws RocksDatabaseException {
public void testManagedDirectSliceWithOffset(int offset, int numberOfBytesWritten) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@swamirishi ,

  • Just found that the offset, which is not useful anymore, should be removed.
  • The hardcoded test cases are too simple: too few and all the sizes are power of 2. We should test some random sizes.

Filed HDDS-14259

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants