-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ecarton/cumulus 3751 s3 task #3910
base: feature/CUMULUS-3751
Are you sure you want to change the base?
Conversation
…ecarton/CUMULUS-3751-as-separate-task
…umulus into CUMULUS-3757-move-granule
…ecarton/CUMULUS-3751-as-separate-task
… AL2023 (#3870) * Updated user-data for compatibility with Amazon Linux 2023 AMI * use amazon 2023 ami * install lvm2 * fix task-reaper * add more log for debug * update changelog * test backward compatibility image_id_ecs_amz2 * change back to /ngap/amis/image_id_ecs_al2023_x86 * update ecs user data use IMDSv2 * update fakeprovider IMDSv2 template aws cli v2 * update fake-provider --------- Co-authored-by: mikedorfman <[email protected]>
adding cbanh CI stack
* Update migration to remove vacuum statements that can time out in a Lambda env * Remove additional vacuum statements * Add CL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding questions from cmr-utils! Thanks for your patience.
chunkSize?: number | ||
} | ||
): Promise<DeleteObjectCommandOutput> => { | ||
const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider: Should we have a success path test around this method working aside the wrapper implementation?
@@ -560,7 +565,6 @@ function constructRelatedUrls({ | |||
files, | |||
distEndpoint, | |||
bucketTypes, | |||
s3CredsEndpoint = 's3credentials', | |||
cmrGranuleUrlType = 'both', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given s3CredsEndpoint isn't used anywhere else in this file, we should probably just define it in the function scope rather than create a top-level variable.
packages/cmrjs/src/cmr-utils.js
Outdated
const useDirectS3Type = shouldUseDirectS3Type(metadataObject); | ||
|
||
const newURLs = constructRelatedUrls({ | ||
updateUMMGMetadataObject({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably not great for maintenance/clarity that this method mutates metadataObject
- can we refactor it to not do that?
Meaning, updateUMMGMetadataObject
should require a return from the method to get the updated object, not change something defined in the scope of the parent. I'd expect a clone, or expansion assignment or something like that in the child method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point
@@ -749,7 +782,7 @@ async function updateUMMGMetadata({ | |||
* @param {string} cmrConfig.certificate - Launchpad certificate | |||
* @param {string} cmrConfig.username - EDL username | |||
* @param {string} cmrConfig.passwordSecretName - CMR password secret name | |||
* @returns {Promise<Object>} object to create CMR instance - contains the | |||
* @returns {Promise<CMRConstructorParams>} object to create CMR instance - contains the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
packages/cmrjs/src/cmr-utils.js
Outdated
@@ -691,6 +695,43 @@ function shouldUseDirectS3Type(metadataObject) { | |||
return false; | |||
} | |||
|
|||
/** | |||
* | |||
* @param {Object} params |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we define these params a bit better re: typing? Not asking for metadataObject, but bucketTypes/BucketMap/files might be reasonable given this is exported.
@@ -848,31 +881,24 @@ function buildMergedEchoURLObject(URLlist = [], originalURLlist = [], removedURL | |||
} | |||
|
|||
/** | |||
* After files are moved, creates new online access URLs and then updates | |||
* the S3 ECHO10 CMR XML file with this information. | |||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edit: Docstring looks good, but should probably have header text
packages/cmrjs/src/cmr-utils.js
Outdated
* @param {string} attributePath - subObject path to seek | ||
* @returns {string | null} | ||
*/ | ||
const findCollectionAttributePath = (cmrObject, attributePath) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we need to recurse through the entire object? Is collection not a well defined key in the CMR spec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also.... if it isn't a well defined location in the spec, are we certain there can't be duplicates/etc?
packages/cmrjs/src/cmr-utils.js
Outdated
) => { | ||
const backupPath = defaultPath || identifierPath; | ||
const fullPath = findCollectionAttributePath(cmrObject, identifierPath) || backupPath; | ||
set(cmrObject, fullPath, value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably avoid mutation here, or consider not abstracting this?
I'm less concerned about the implementation as stands, but that this isn't by convention/etc something that someone would think twice about importing elsewhere at some point in the future.
t.is(updated.Granule.WhyThisAttribute[1].CollectionReference.Version, 'b'); | ||
}); | ||
|
||
test('updateCmrFileCollections updates umm when missing', (t) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is adding a Collection.Shortname if it doesn't exist to the user metadata?
Can we document/elaborate/discuss why that choice was made?
t.is(updated.CollectionReference.Version, 'b'); | ||
}); | ||
|
||
test('updateCmrFileCollections updates umm at non-standard locations', (t) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What makes this a non-standard location? Is it possible for it to be defined in both places? Same question for UMM as well.
} | ||
} | ||
}, | ||
"oldGranules": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's still an open question re: this output with respect to what we need downstream, but that can probably be modified as part of that work if needed. Commenting mostly as a bookmark to have that conversation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review still in progress, further commentary
import { ApiFile, ApiGranuleRecord } from '@cumulus/types'; | ||
import { AssertionError } from 'assert'; | ||
|
||
export type ValidApiFile = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Is this a legit type for the types package? Should it be?
} & ApiFile; | ||
|
||
export type ValidGranuleRecord = { | ||
files: Omit<ValidApiFile, 'granuleId'>[] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: This also raises a question of if there should be a type that doesn't include granuleId given we're doing this sort of omission both in types/granules and here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider making this a defined type to DRY up the Omits in the task
return true; | ||
} | ||
|
||
export function apiGranuleRecordIsValid(granule: ApiGranuleRecord): granule is ValidGranuleRecord { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: So... we're saying that granules must have files to be valid for the methods using this type.... saving this thought as I haven't completed the review, but what if there are collections being moved where there are some granules that don't have files? Is that a valid case? Added nit tag as I'm unconvinced this is a useful edge case in practice.
export const uploadCMRFile = async (cmrFile: Omit<ValidApiFile, 'granuleId'>, cmrObject: { Granule?: object }) => { | ||
let cmrFileString; | ||
if (isUMMGFilename(cmrFile.fileName || cmrFile.key)) { | ||
cmrFileString = JSON.stringify(cmrObject, undefined, 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why '2' here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
throughout our code, when we upload json, we upload it as a prettified string at 2 indentation
cmrFileString = JSON.stringify(cmrObject, undefined, 2); | ||
} else { | ||
// our xml stringify function packages the metadata in "Granule", | ||
// resulting in possible nested Granule object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you talk me through this a bit? I see generateEcho10XMLString putting values in Object.Granule
, but is `{ Granule?: object } the typing in the method because it could be UMMG or Echo10XML?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(also: Direct unit coverage here might be helpful in documenting the edge case?)
import { getRequiredEnvVar } from '@cumulus/common/env'; | ||
import { fetchDistributionBucketMap } from '@cumulus/distribution-utils'; | ||
import { AssertionError } from 'assert'; | ||
import { apiGranuleRecordIsValid, isCMRMetadataFile, updateCmrFileCollections, uploadCMRFile, ValidApiFile, ValidGranuleRecord } from './update_cmr_file_collection'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: line length, this should be line-split
import { log } from '@cumulus/common'; | ||
import { CollectionRecord } from '@cumulus/types'; | ||
import { CumulusMessage } from '@cumulus/types/message'; | ||
import { CollectionFile } from '@cumulus/types'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix multiple imports from the same package
granulesToCmrFileObjects, | ||
metadataObjectFromCMRFile, | ||
} from '@cumulus/cmrjs'; | ||
import { BucketsConfig } from '@cumulus/common'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix multiple imports from the same package
import keyBy from 'lodash/keyBy'; | ||
import cloneDeep from 'lodash/cloneDeep'; | ||
import zip from 'lodash/zip'; | ||
// eslint-disable-next-line lodash/import-scope |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Can we better organize these imports (e.g. @cumulus
imports should be paragraphed together, lodash, non-project imports before @cumulus
, etc)?
} | ||
|
||
/** | ||
* Is this move a "real" move, or is target location identical to source |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: rephrase into 'actionable' phrase, e.g. 'boolean test that determines X'
import { AssertionError } from 'assert'; | ||
import { apiGranuleRecordIsValid, isCMRMetadataFile, updateCmrFileCollections, uploadCMRFile, ValidApiFile, ValidGranuleRecord } from './update_cmr_file_collection'; | ||
|
||
const MB = 1024 * 1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I see we have the pattern elsewhere, but low-key not a fan of replicating this definition everywhere. We should consider if this is a reasonable case for a common definition/export, but not required in this PR.
|
||
const MB = 1024 * 1024; | ||
|
||
type EventConfig = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maintenance/convention: Consider extracting these to a types.ts file
} | ||
}; | ||
|
||
function getConcurrency() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason this should be a method over a top-scope constant?
/** | ||
* Is this move a "real" move, or is target location identical to source | ||
*/ | ||
function moveRequested( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like it should be named differently, something like s3BucketKeyMatch
or something clever involving an identitiy property, or ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps objectSourceAndTargetSame?
return false; | ||
} | ||
|
||
const targetExists = await S3.s3ObjectExists({ Bucket: targetFile.bucket, Key: targetFile.key }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this actually a safe operation with respect to data management? What if the new collection is misconfigured and the first run has file collisions?
I grok our workflow design is intended to be idempotent, but in reading through the code think this highlights the needs for/ warrants task/feature documentation -- in the case of a collision a granule could wind up with co-mingled files with another granule if the collection regular expression isn't carefully constructed.
edit with the caveat that I'm still reviewing through the file, the answer may be apparent in the ordering of these functions /edit
return false; | ||
} | ||
|
||
const sourceExists = await S3.s3ObjectExists({ Bucket: sourceFile.bucket, Key: sourceFile.key }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: these could probably be run in parallel for efficiency reasons.
} | ||
|
||
/** | ||
* Validates the file matched only one collection and has a valid bucket |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: collection configuration regular expression
cmrObjects: { [key: string]: Object }, | ||
s3MultipartChunksizeMb?: number, | ||
}): Promise<void> { | ||
await pMap( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make this also have retry behavior?
zip(sourceGranules, targetGranules), | ||
async ([sourceGranule, targetGranule]) => { | ||
if (!sourceGranule?.files || !targetGranule?.files) { | ||
throw new AssertionError({ message: 'size mismatch between target and source granules' }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: this error may not be descriptive of the issue, meaning 'mismatch' doesn't have meaning unless you say target and source granule arrays, and also that it's possible this error occurs not because of a size mismatch, but because a granule record has no files.
} | ||
})); | ||
}, | ||
{ concurrency: getConcurrency() } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So - concurrency is per granule, but not per file? Is that acceptable in a case where granules have 1000 files? 10k? etc.
Co-authored-by: Jonathan Kovarik <[email protected]>
Co-authored-by: Jonathan Kovarik <[email protected]>
Summary: 3751 just the s3 copy part
Addresses CUMULUS-3751: Move granules across collections
Changes
PR Checklist