generated from amazon-archives/__template_Apache-2.0
-
Notifications
You must be signed in to change notification settings - Fork 69
Merge latest changes from main to 'Documentation' branch #192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
rsareddy0329
wants to merge
161
commits into
documentation
Choose a base branch
from
main
base: documentation
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Co-authored-by: adishaa <adishaa@amazon.com>
… with minor improvements and bug fixes (#137)
… with minor improvements and bug fixes. (#139)
…ception count data (#140)
* manual release v3.0.1
… regionalized HMA URI (#141)
* Add unique time string to integ test * Update syntax
* Training CLI & SDK: example notebook and README update * Update training cli example notebook --------- Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com>
* Update inferenece SDK examples * Update readme
* Enable Hyperpod telemetry * Enable Hyperpod telemetry * Enable Hyperpod telemetry * Enable Hyperpod telemetry * Enable Hyperpod telemetry * Enable Hyperpod telemetry * CLI: Enable Telemetry * CLI: Enable Telemetry --------- Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com>
* update help text to avoid truncation * update volume flag to support hostPath and pvc, before e2e testing * clean up and e2e working * Minor updates after PR * update * Added unit tests for volume, all cli unit tests passed
Co-authored-by: pintaoz <pintaoz@amazon.com>
* Update inference config and integ tests * Update integ tests for new canaries
* Manual release v3.0.2 * Update changelog --------- Co-authored-by: pintaoz <pintaoz@amazon.com>
* Update readme for volume flag * Add schema pattern check to pytorch-job template, unit test added, all test passed locally
…8s (#138) * Add k8s version validation check between server and client version according to the supported versioning constraints by k8s * Fix unit test cases * Move regex to a constant. **Description** - Removed an integration test case that was being mocked. - Moved a regex to a constant. **Testing Done** Unit test cases pass no changes made to integration test cases and they should not be affected. * Add k8s version validation check between server and client version according to the supported versioning constraints by k8s * Add ref link for version comptability contraints **Description** Added a link to k8s documentation mentioning the constraints that rule the version compatibility of client and server. **Testing Done** No breaking changes.
* Fix SDK training test: Add wait time before refresh * Fix training tests in canaries
…189) Co-authored-by: pintaoz <pintaoz@amazon.com>
* Update documentation-with-new-changes branch with latest changes from main (#190) * Fix training test (#184) * Fix SDK training test: Add wait time before refresh * Fix training tests in canaries * Update logging information for submitting and deleting training job (#189) Co-authored-by: pintaoz <pintaoz@amazon.com> --------- Co-authored-by: Zhaoqi <zhaoqiwang.baruch@gmail.com> Co-authored-by: pintaoz-aws <167920275+pintaoz-aws@users.noreply.github.com> Co-authored-by: pintaoz <pintaoz@amazon.com> * Documentation Fixes (#191) Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com> * update documentation with new changes branch with latest changes (#194) * Fix training test (#184) * Fix SDK training test: Add wait time before refresh * Fix training tests in canaries * Update logging information for submitting and deleting training job (#189) Co-authored-by: pintaoz <pintaoz@amazon.com> --------- Co-authored-by: Zhaoqi <zhaoqiwang.baruch@gmail.com> Co-authored-by: pintaoz-aws <167920275+pintaoz-aws@users.noreply.github.com> Co-authored-by: pintaoz <pintaoz@amazon.com> * Documentation Fixes (#195) * Documentation Fixes * Documentation Fixes --------- Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com> * Documentation Fixes (#197) * Documentation Fixes * Documentation Fixes * Documentation Fixes * Documentation Fixes --------- Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com> * Documentation Fixes (#198) * Documentation Fixes * Documentation Fixes * Documentation Fixes * Documentation Fixes * Documentation Fixes --------- Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com> * Documentation fixes (#199) * Documentation Fixes * Documentation Fixes * Documentation Fixes * Documentation Fixes * Documentation Fixes * Documentation Fixes --------- Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com> --------- Co-authored-by: Zhaoqi <zhaoqiwang.baruch@gmail.com> Co-authored-by: pintaoz-aws <167920275+pintaoz-aws@users.noreply.github.com> Co-authored-by: pintaoz <pintaoz@amazon.com> Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com>
…s to view SDK config code (#188) Co-authored-by: Mohamed Zeidan <zeidmo@amazon.com>
* Add instance type support for ml.p6e-gb200.36xlarge Updated support for ml.p6-b200.48xlarge as well * Add ml.p6e-gb200.36xlarge to efa plugin
…holder value (#206) Co-authored-by: Mohamed Zeidan <zeidmo@amazon.com>
* Implement validation for mig profiles when creating/updating spaces * Update Space parameter model * Make Space Template namespaced resource
* Update Space Template CLI to be namespaced * Space get-logs default to the workspace container * Remove error handling to bubble up the actual K8s errors * Listing public Spaces * Fix typos, elaborated text, add logic to parse idle-shutdown
Inference tests succeeded with parker-cli code - https://quip-amazon.com/fhwhAAMht0Mm/Project-Parker-HyperPod-User-Experience-for-Data-Scientist-persona Parker-cli integ tests pass (shown below) These inference tests failing are known to be flaky- https://w.amazon.com/bin/view/AWS/AmazonAI/Platform/Codex/CodexInfra/Runbooks/HyperPodCLI/TroubleshootInferenceTests#HTroubleshooting ticket has been created to fix these flaky tests - https://t.corp.amazon.com/V1943878058 Parker-cli integ tests passing ============================= test session starts ============================== platform linux -- Python 3.11.14, pytest-8.3.2, pluggy-1.6.0 -- /root/.pyenv/versions/3.11.14/bin/python3.11 cachedir: .pytest_cache rootdir: /codebuild/output/src1458832038/src/github.com/aws/private-sagemaker-hyperpod-cli-staging configfile: setup.cfg plugins: hydra-core-1.3.2, order-1.3.0, dependency-0.6.0, cov-5.0.0 collecting ... collected 39 items test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_create PASSED [ 2%] test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_list_table PASSED [ 5%] test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_list_json PASSED [ 7%] test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_describe_yaml PASSED [ 10%] test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_describe_json PASSED [ 12%] test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_stop PASSED [ 15%] test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_start PASSED [ 17%] test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_update PASSED [ 20%] test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_get_logs PASSED [ 23%] test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_delete PASSED [ 25%] test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_list_empty_namespace PASSED [ 28%] test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_describe_nonexistent PASSED [ 30%] test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_delete_nonexistent PASSED [ 33%] test/integration_tests/space/cli/test_cli_space_template.py::TestSpaceTemplateCLI::test_space_template_create PASSED [ 35%] test/integration_tests/space/cli/test_cli_space_template.py::TestSpaceTemplateCLI::test_space_template_list_table PASSED [ 38%] test/integration_tests/space/cli/test_cli_space_template.py::TestSpaceTemplateCLI::test_space_template_list_json PASSED [ 41%] test/integration_tests/space/cli/test_cli_space_template.py::TestSpaceTemplateCLI::test_space_template_describe_yaml PASSED [ 43%] test/integration_tests/space/cli/test_cli_space_template.py::TestSpaceTemplateCLI::test_space_template_describe_json PASSED [ 46%] test/integration_tests/space/cli/test_cli_space_template.py::TestSpaceTemplateCLI::test_space_template_update PASSED [ 48%] test/integration_tests/space/cli/test_cli_space_template.py::TestSpaceTemplateCLI::test_space_template_delete PASSED [ 51%] test/integration_tests/space/cli/test_cli_space_template.py::TestSpaceTemplateCLI::test_space_template_list_empty_namespace PASSED [ 53%] test/integration_tests/space/cli/test_cli_space_template.py::TestSpaceTemplateCLI::test_space_template_describe_nonexistent PASSED [ 56%] test/integration_tests/space/cli/test_cli_space_template.py::TestSpaceTemplateCLI::test_space_template_delete_nonexistent PASSED [ 58%] test/integration_tests/space/sdk/test_sdk_space.py::test_create_space PASSED [ 61%] test/integration_tests/space/sdk/test_sdk_space.py::test_list_spaces PASSED [ 64%] test/integration_tests/space/sdk/test_sdk_space.py::test_get_space PASSED [ 66%] test/integration_tests/space/sdk/test_sdk_space.py::test_wait_until_running PASSED [ 69%] test/integration_tests/space/sdk/test_sdk_space.py::test_update_space PASSED [ 71%] test/integration_tests/space/sdk/test_sdk_space.py::test_stop_space PASSED [ 74%] test/integration_tests/space/sdk/test_sdk_space.py::test_start_space PASSED [ 76%] test/integration_tests/space/sdk/test_sdk_space.py::test_list_pods PASSED [ 79%] test/integration_tests/space/sdk/test_sdk_space.py::test_get_logs PASSED [ 82%] test/integration_tests/space/sdk/test_sdk_space.py::test_create_space_access SKIPPED [ 84%] test/integration_tests/space/sdk/test_sdk_space.py::test_delete_space PASSED [ 87%] test/integration_tests/space/sdk/test_sdk_space_template.py::TestHPSpaceTemplate::test_create_template PASSED [ 89%] test/integration_tests/space/sdk/test_sdk_space_template.py::TestHPSpaceTemplate::test_list_templates PASSED [ 92%] test/integration_tests/space/sdk/test_sdk_space_template.py::TestHPSpaceTemplate::test_get_template PASSED [ 94%] test/integration_tests/space/sdk/test_sdk_space_template.py::TestHPSpaceTemplate::test_update_template PASSED [ 97%] test/integration_tests/space/sdk/test_sdk_space_template.py::TestHPSpaceTemplate::test_delete_template PASSED [100%] =============================== warnings summary ===============================
* Update README for fractional gpu support * update pytorch job example * add example for accelerator partitions
* feat: Implement elastic training cli arguments (#273) * feat: Implement elastic training cli arguments * Add elastic training unified config and unit test * Add graceful shutdown and scaling timeout to cli args * Revert "feat: Implement elastic training cli arguments (#273)" This reverts commit 18428ef2b1c0562bf51a9a4b4aa2914eed441259. * Add dev_space_constants.py (#255) Co-authored-by: Brian Xia <brianxia@amazon.com> * Add dev_space_access_constants.py (#256) Co-authored-by: Brian Xia <brianxia@amazon.com> * Add space_admin_config_constants.py (#257) Co-authored-by: Brian Xia <brianxia@amazon.com> * Add template package only (#261) Co-authored-by: Brian Xia <brianxia@amazon.com> * Add dev_space.py CLI command (#263) * Add dev_space.py CLI command * Add dev space unit tests --------- Co-authored-by: Brian Xia <brianxia@amazon.com> * Add dev_space_utils.py to work with the dev space template model (#262) * Add dev_space_utils.py * Add unit tests for dev_space_utils --------- Co-authored-by: Brian Xia <brianxia@amazon.com> * Add dev space CLI (#269) * Rename dev space to space (#272) * Update the Space model and constants per latest operator (#275) * Add space_admin_config.py CLI command (#260) * Add space_admin_config.py CLI command * Update the space admin config to space template --------- Co-authored-by: Brian Xia <brianxia@amazon.com> * Implement CRUD operations for Space PySDK (#267) * Implement CRUD operations for Space PySDK * Update Space PySDK per new schema * Update Space PySDK per new schema * Implement the pySDK for the Space Template (#282) * Refactor Space CLI using the Space PySDK (#281) * Implement CRUD operations for Space PySDK * Update Space PySDK per new schema * Refactor CLI to use the PySDK * Add dev_space_access.py CLI command (#259) * Add dev_space_access.py CLI command * Add space access creation to pySDK and refactor space access CLI --------- Co-authored-by: Brian Xia <brianxia@amazon.com> * Listing space will filter out the spaces not created by the current user (#285) * Implement CRUD operations for Space PySDK * Update Space PySDK per new schema * Implement CRUD operations for Space PySDK * Update Space PySDK per new schema * Update Space PySDK per new schema * Implement space list pagination and creator filtering * Refactor space template with PySDK (#286) * Add additional Space parameters for resources including the fractional GPU (#287) * Implement validation for mig profiles for Spaces (#291) * Implement validation for mig profiles when creating/updating spaces * Update Space parameter model * Make Space Template namespaced resource * Parker GA issues (#296) * Update Space Template CLI to be namespaced * Space get-logs default to the workspace container * Remove error handling to bubble up the actual K8s errors * Listing public Spaces * Fix typos, elaborated text, add logic to parse idle-shutdown * Fix the template ref regression (#300) * Update SageMaker Space documentation (#301) * Implement Space integration tests (#298) Inference tests succeeded with parker-cli code - https://quip-amazon.com/fhwhAAMht0Mm/Project-Parker-HyperPod-User-Experience-for-Data-Scientist-persona Parker-cli integ tests pass (shown below) These inference tests failing are known to be flaky- https://w.amazon.com/bin/view/AWS/AmazonAI/Platform/Codex/CodexInfra/Runbooks/HyperPodCLI/TroubleshootInferenceTests#HTroubleshooting ticket has been created to fix these flaky tests - https://t.corp.amazon.com/V1943878058 Parker-cli integ tests passing ============================= test session starts ============================== platform linux -- Python 3.11.14, pytest-8.3.2, pluggy-1.6.0 -- /root/.pyenv/versions/3.11.14/bin/python3.11 cachedir: .pytest_cache rootdir: /codebuild/output/src1458832038/src/github.com/aws/private-sagemaker-hyperpod-cli-staging configfile: setup.cfg plugins: hydra-core-1.3.2, order-1.3.0, dependency-0.6.0, cov-5.0.0 collecting ... collected 39 items test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_create PASSED [ 2%] test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_list_table PASSED [ 5%] test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_list_json PASSED [ 7%] test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_describe_yaml PASSED [ 10%] test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_describe_json PASSED [ 12%] test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_stop PASSED [ 15%] test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_start PASSED [ 17%] test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_update PASSED [ 20%] test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_get_logs PASSED [ 23%] test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_delete PASSED [ 25%] test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_list_empty_namespace PASSED [ 28%] test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_describe_nonexistent PASSED [ 30%] test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_delete_nonexistent PASSED [ 33%] test/integration_tests/space/cli/test_cli_space_template.py::TestSpaceTemplateCLI::test_space_template_create PASSED [ 35%] test/integration_tests/space/cli/test_cli_space_template.py::TestSpaceTemplateCLI::test_space_template_list_table PASSED [ 38%] test/integration_tests/space/cli/test_cli_space_template.py::TestSpaceTemplateCLI::test_space_template_list_json PASSED [ 41%] test/integration_tests/space/cli/test_cli_space_template.py::TestSpaceTemplateCLI::test_space_template_describe_yaml PASSED [ 43%] test/integration_tests/space/cli/test_cli_space_template.py::TestSpaceTemplateCLI::test_space_template_describe_json PASSED [ 46%] test/integration_tests/space/cli/test_cli_space_template.py::TestSpaceTemplateCLI::test_space_template_update PASSED [ 48%] test/integration_tests/space/cli/test_cli_space_template.py::TestSpaceTemplateCLI::test_space_template_delete PASSED [ 51%] test/integration_tests/space/cli/test_cli_space_template.py::TestSpaceTemplateCLI::test_space_template_list_empty_namespace PASSED [ 53%] test/integration_tests/space/cli/test_cli_space_template.py::TestSpaceTemplateCLI::test_space_template_describe_nonexistent PASSED [ 56%] test/integration_tests/space/cli/test_cli_space_template.py::TestSpaceTemplateCLI::test_space_template_delete_nonexistent PASSED [ 58%] test/integration_tests/space/sdk/test_sdk_space.py::test_create_space PASSED [ 61%] test/integration_tests/space/sdk/test_sdk_space.py::test_list_spaces PASSED [ 64%] test/integration_tests/space/sdk/test_sdk_space.py::test_get_space PASSED [ 66%] test/integration_tests/space/sdk/test_sdk_space.py::test_wait_until_running PASSED [ 69%] test/integration_tests/space/sdk/test_sdk_space.py::test_update_space PASSED [ 71%] test/integration_tests/space/sdk/test_sdk_space.py::test_stop_space PASSED [ 74%] test/integration_tests/space/sdk/test_sdk_space.py::test_start_space PASSED [ 76%] test/integration_tests/space/sdk/test_sdk_space.py::test_list_pods PASSED [ 79%] test/integration_tests/space/sdk/test_sdk_space.py::test_get_logs PASSED [ 82%] test/integration_tests/space/sdk/test_sdk_space.py::test_create_space_access SKIPPED [ 84%] test/integration_tests/space/sdk/test_sdk_space.py::test_delete_space PASSED [ 87%] test/integration_tests/space/sdk/test_sdk_space_template.py::TestHPSpaceTemplate::test_create_template PASSED [ 89%] test/integration_tests/space/sdk/test_sdk_space_template.py::TestHPSpaceTemplate::test_list_templates PASSED [ 92%] test/integration_tests/space/sdk/test_sdk_space_template.py::TestHPSpaceTemplate::test_get_template PASSED [ 94%] test/integration_tests/space/sdk/test_sdk_space_template.py::TestHPSpaceTemplate::test_update_template PASSED [ 97%] test/integration_tests/space/sdk/test_sdk_space_template.py::TestHPSpaceTemplate::test_delete_template PASSED [100%] =============================== warnings summary =============================== * merge conflicts fixed * Update README for fractional gpu support (#294) * Update README for fractional gpu support * update pytorch job example * add example for accelerator partitions * merge conflicts from js template and inference * update changelog * uncommented install req * uncommented * fixed uncomment --------- Co-authored-by: Sophia <yungwenh@amazon.com> Co-authored-by: Molly He <mollyhe@amazon.com> Co-authored-by: Brian Xia <brianfruitose@gmail.com> Co-authored-by: Brian Xia <brianxia@amazon.com> Co-authored-by: Mohamed Zeidan <zeidmo@amazon.com> Co-authored-by: Ophelia Yang <86372475+oyangz@users.noreply.github.com>
Co-authored-by: Mohamed Zeidan <zeidmo@amazon.com>
Co-authored-by: Mohamed Zeidan <zeidmo@amazon.com>
* integration test for jumpstart with mig profile * template fix for mig with jumpstart * skipped mig tests until instances setup finished --------- Co-authored-by: Chad Chiang <chadchc@amazon.com>
* set template-version flag to optional for cluster create, add support for efa for pytorch job, remove default request and limits when instance type is none * fix gpu allocation validation error * remove redundant * fix unit test and expand logic to memory and vcpu field * Follow up on merge conflict in release * consolidate all debug flags to show kubernates exception * Revert "Follow up on merge conflict in release" This reverts commit c816838. * fix unit and integ test for space * fix more unit test for space * change dependency for delete in init integ test
* integration test for jumpstart with mig profile * template fix for mig with jumpstart * skipped mig tests until instances setup finished * enable the mig integration tests --------- Co-authored-by: Chad Chiang <chadchc@amazon.com>
* Upgrade Inference Operator Version (#327) * pyproj version update (#328) Co-authored-by: Mohamed Zeidan <zeidmo@amazon.com> * version change (#329) Co-authored-by: Mohamed Zeidan <zeidmo@amazon.com> * elastic training to keynote3 (#307) * feat: Implement elastic training cli arguments (#273) * feat: Implement elastic training cli arguments * Add elastic training unified config and unit test * Add graceful shutdown and scaling timeout to cli args * Revert "feat: Implement elastic training cli arguments (#273)" This reverts commit 18428ef2b1c0562bf51a9a4b4aa2914eed441259. * feat: Implement elastic training cli arguments (#295) * feat: implement elastic training cli args * Rename args name to match crd for elastic training * Add unit test for replcia discrete values * Add integ test for elastic training cli --------- Co-authored-by: Sophia <yungwenh@amazon.com> Co-authored-by: Molly He <mollyhe@amazon.com> Co-authored-by: Mohamed Zeidan <zeidmo@amazon.com> * version update for v3.5.0 --------- Co-authored-by: Shantanu Tripathi <shantanutripathi237@gmail.com> Co-authored-by: Mohamed Zeidan <81834882+mohamedzeidan2021@users.noreply.github.com> Co-authored-by: Mohamed Zeidan <zeidmo@amazon.com> Co-authored-by: Sophia <yungwenh@amazon.com>
* Update documentation for elastic training arguments * nit: Add detail descriptions for array type
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR Approval Steps
For Requester
For Reviewer
For Requestersection to double check each item.