Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ecarton/cumulus 3751 s3 task #3910

Open
wants to merge 344 commits into
base: feature/CUMULUS-3751
Choose a base branch
from

Conversation

etcart
Copy link
Contributor

@etcart etcart commented Jan 28, 2025

Summary: 3751 just the s3 copy part
Addresses CUMULUS-3751: Move granules across collections

Changes

  • task for copying granules' s3 files from one collection to another
  • workflow and integration test for this s3 file copy, which is intended to be expanded to include the rest of the workflow as those are ready

PR Checklist

  • Update CHANGELOG
  • Unit tests
  • Ad-hoc testing - Deploy changes and test manually
  • Integration tests

etcart and others added 30 commits December 6, 2024 11:35
… AL2023 (#3870)

* Updated user-data for compatibility with Amazon Linux 2023 AMI

* use amazon 2023 ami

* install lvm2

* fix task-reaper

* add more log for debug

* update changelog

* test backward compatibility image_id_ecs_amz2

* change back to /ngap/amis/image_id_ecs_al2023_x86

* update ecs user data use IMDSv2

* update fakeprovider IMDSv2 template aws cli v2

* update fake-provider

---------

Co-authored-by: mikedorfman <[email protected]>
* Update migration to remove vacuum statements that can time out in a Lambda env

* Remove additional vacuum statements

* Add CL
Copy link
Member

@Jkovarik Jkovarik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding questions from cmr-utils! Thanks for your patience.

chunkSize?: number
}
): Promise<DeleteObjectCommandOutput> => {
const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider: Should we have a success path test around this method working aside the wrapper implementation?

@@ -560,7 +565,6 @@ function constructRelatedUrls({
files,
distEndpoint,
bucketTypes,
s3CredsEndpoint = 's3credentials',
cmrGranuleUrlType = 'both',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given s3CredsEndpoint isn't used anywhere else in this file, we should probably just define it in the function scope rather than create a top-level variable.

const useDirectS3Type = shouldUseDirectS3Type(metadataObject);

const newURLs = constructRelatedUrls({
updateUMMGMetadataObject({
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably not great for maintenance/clarity that this method mutates metadataObject - can we refactor it to not do that?

Meaning, updateUMMGMetadataObject should require a return from the method to get the updated object, not change something defined in the scope of the parent. I'd expect a clone, or expansion assignment or something like that in the child method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point

@@ -749,7 +782,7 @@ async function updateUMMGMetadata({
* @param {string} cmrConfig.certificate - Launchpad certificate
* @param {string} cmrConfig.username - EDL username
* @param {string} cmrConfig.passwordSecretName - CMR password secret name
* @returns {Promise<Object>} object to create CMR instance - contains the
* @returns {Promise<CMRConstructorParams>} object to create CMR instance - contains the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@@ -691,6 +695,43 @@ function shouldUseDirectS3Type(metadataObject) {
return false;
}

/**
*
* @param {Object} params
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we define these params a bit better re: typing? Not asking for metadataObject, but bucketTypes/BucketMap/files might be reasonable given this is exported.

@@ -848,31 +881,24 @@ function buildMergedEchoURLObject(URLlist = [], originalURLlist = [], removedURL
}

/**
* After files are moved, creates new online access URLs and then updates
* the S3 ECHO10 CMR XML file with this information.
*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit: Docstring looks good, but should probably have header text

* @param {string} attributePath - subObject path to seek
* @returns {string | null}
*/
const findCollectionAttributePath = (cmrObject, attributePath) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we need to recurse through the entire object? Is collection not a well defined key in the CMR spec?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also.... if it isn't a well defined location in the spec, are we certain there can't be duplicates/etc?

) => {
const backupPath = defaultPath || identifierPath;
const fullPath = findCollectionAttributePath(cmrObject, identifierPath) || backupPath;
set(cmrObject, fullPath, value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably avoid mutation here, or consider not abstracting this?

I'm less concerned about the implementation as stands, but that this isn't by convention/etc something that someone would think twice about importing elsewhere at some point in the future.

t.is(updated.Granule.WhyThisAttribute[1].CollectionReference.Version, 'b');
});

test('updateCmrFileCollections updates umm when missing', (t) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is adding a Collection.Shortname if it doesn't exist to the user metadata?

Can we document/elaborate/discuss why that choice was made?

t.is(updated.CollectionReference.Version, 'b');
});

test('updateCmrFileCollections updates umm at non-standard locations', (t) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What makes this a non-standard location? Is it possible for it to be defined in both places? Same question for UMM as well.

}
}
},
"oldGranules": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's still an open question re: this output with respect to what we need downstream, but that can probably be modified as part of that work if needed. Commenting mostly as a bookmark to have that conversation.

Copy link
Member

@Jkovarik Jkovarik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review still in progress, further commentary

import { ApiFile, ApiGranuleRecord } from '@cumulus/types';
import { AssertionError } from 'assert';

export type ValidApiFile = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Is this a legit type for the types package? Should it be?

} & ApiFile;

export type ValidGranuleRecord = {
files: Omit<ValidApiFile, 'granuleId'>[]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This also raises a question of if there should be a type that doesn't include granuleId given we're doing this sort of omission both in types/granules and here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider making this a defined type to DRY up the Omits in the task

return true;
}

export function apiGranuleRecordIsValid(granule: ApiGranuleRecord): granule is ValidGranuleRecord {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: So... we're saying that granules must have files to be valid for the methods using this type.... saving this thought as I haven't completed the review, but what if there are collections being moved where there are some granules that don't have files? Is that a valid case? Added nit tag as I'm unconvinced this is a useful edge case in practice.

export const uploadCMRFile = async (cmrFile: Omit<ValidApiFile, 'granuleId'>, cmrObject: { Granule?: object }) => {
let cmrFileString;
if (isUMMGFilename(cmrFile.fileName || cmrFile.key)) {
cmrFileString = JSON.stringify(cmrObject, undefined, 2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why '2' here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

throughout our code, when we upload json, we upload it as a prettified string at 2 indentation

cmrFileString = JSON.stringify(cmrObject, undefined, 2);
} else {
// our xml stringify function packages the metadata in "Granule",
// resulting in possible nested Granule object
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you talk me through this a bit? I see generateEcho10XMLString putting values in Object.Granule, but is `{ Granule?: object } the typing in the method because it could be UMMG or Echo10XML?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also: Direct unit coverage here might be helpful in documenting the edge case?)

import { getRequiredEnvVar } from '@cumulus/common/env';
import { fetchDistributionBucketMap } from '@cumulus/distribution-utils';
import { AssertionError } from 'assert';
import { apiGranuleRecordIsValid, isCMRMetadataFile, updateCmrFileCollections, uploadCMRFile, ValidApiFile, ValidGranuleRecord } from './update_cmr_file_collection';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: line length, this should be line-split

import { log } from '@cumulus/common';
import { CollectionRecord } from '@cumulus/types';
import { CumulusMessage } from '@cumulus/types/message';
import { CollectionFile } from '@cumulus/types';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix multiple imports from the same package

granulesToCmrFileObjects,
metadataObjectFromCMRFile,
} from '@cumulus/cmrjs';
import { BucketsConfig } from '@cumulus/common';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix multiple imports from the same package

import keyBy from 'lodash/keyBy';
import cloneDeep from 'lodash/cloneDeep';
import zip from 'lodash/zip';
// eslint-disable-next-line lodash/import-scope
Copy link
Member

@Jkovarik Jkovarik Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Can we better organize these imports (e.g. @cumulus imports should be paragraphed together, lodash, non-project imports before @cumulus, etc)?

}

/**
* Is this move a "real" move, or is target location identical to source
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: rephrase into 'actionable' phrase, e.g. 'boolean test that determines X'

import { AssertionError } from 'assert';
import { apiGranuleRecordIsValid, isCMRMetadataFile, updateCmrFileCollections, uploadCMRFile, ValidApiFile, ValidGranuleRecord } from './update_cmr_file_collection';

const MB = 1024 * 1024;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I see we have the pattern elsewhere, but low-key not a fan of replicating this definition everywhere. We should consider if this is a reasonable case for a common definition/export, but not required in this PR.


const MB = 1024 * 1024;

type EventConfig = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maintenance/convention: Consider extracting these to a types.ts file

}
};

function getConcurrency() {
Copy link
Member

@Jkovarik Jkovarik Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason this should be a method over a top-scope constant?

/**
* Is this move a "real" move, or is target location identical to source
*/
function moveRequested(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like it should be named differently, something like s3BucketKeyMatch or something clever involving an identitiy property, or ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps objectSourceAndTargetSame?

return false;
}

const targetExists = await S3.s3ObjectExists({ Bucket: targetFile.bucket, Key: targetFile.key });
Copy link
Member

@Jkovarik Jkovarik Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually a safe operation with respect to data management? What if the new collection is misconfigured and the first run has file collisions?

I grok our workflow design is intended to be idempotent, but in reading through the code think this highlights the needs for/ warrants task/feature documentation -- in the case of a collision a granule could wind up with co-mingled files with another granule if the collection regular expression isn't carefully constructed.

edit with the caveat that I'm still reviewing through the file, the answer may be apparent in the ordering of these functions /edit

return false;
}

const sourceExists = await S3.s3ObjectExists({ Bucket: sourceFile.bucket, Key: sourceFile.key });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: these could probably be run in parallel for efficiency reasons.

}

/**
* Validates the file matched only one collection and has a valid bucket
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: collection configuration regular expression

cmrObjects: { [key: string]: Object },
s3MultipartChunksizeMb?: number,
}): Promise<void> {
await pMap(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make this also have retry behavior?

zip(sourceGranules, targetGranules),
async ([sourceGranule, targetGranule]) => {
if (!sourceGranule?.files || !targetGranule?.files) {
throw new AssertionError({ message: 'size mismatch between target and source granules' });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this error may not be descriptive of the issue, meaning 'mismatch' doesn't have meaning unless you say target and source granule arrays, and also that it's possible this error occurs not because of a size mismatch, but because a granule record has no files.

}
}));
},
{ concurrency: getConcurrency() }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So - concurrency is per granule, but not per file? Is that acceptable in a case where granules have 1000 files? 10k? etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants