Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: option to return matched duplicate IDs in dbFindIdsUniqueTrials() #45

Open
machado-t opened this issue Feb 8, 2025 · 3 comments

Comments

@machado-t
Copy link

machado-t commented Feb 8, 2025

Currently, dbFindIdsUniqueTrials() returns a vector of unique trial IDs after deduplication. However, it does not indicate which IDs were matched/merged into each ID during deduplication.

An option to return matched duplicate IDs would allow for users to compare and analyze data points from multiple registries for the same trial.

Example implementation:

Add an optional parameter (e.g., returnMatches) so that:

  • WhenreturnMatches = FALSE (default): The function returns a vector of unique IDs (as it currently does).
  • When returnMatches = TRUE: The function returns a named list where each key is a unique trial ID and each value is the vector of all duplicate IDs merged into that unique trial. For example:
result <- list(
  "2024-512345-67-00" = c("2024-512345-67-00", "NCT01234567", "2023-002345-67"),
  "NCT01234568" = c("NCT01234568"),
  "2024-512346-78-00" = c("2024-512346-78-00", "2023-002356-89")
)

Thank you!

@rfhb
Copy link
Owner

rfhb commented Feb 9, 2025

Thanks @machado-t. The data frame outSet may have the sought information, it is created as an intermediate here

# find duplicates
and looks like this, with one row representing one trial:

Image

This could be cleaned up (e.g., SPONSOR includes the protocol code) and returned; possibly after refactoring of dbFindIdsUniqueTrials() to pull out the function generating this concordance table.

Please, what are the use cases? Would such a tabular format be alright, or would a list such as above be needed, and if yes, how would the name of the list items be specified by the user?

@machado-t
Copy link
Author

Thank you very much! I managed to get what I needed using outSet via the debug function, so please don't worry too much about implementing this.

Please, what are the use cases?

What I am trying to solve relates to an old database from multiple registries (ICTRP). I need to link it with a dataset collected with ctrdata, which is only possible if I know which trial IDs were merged.

Knowing which trials IDs were matched also allows other use cases. For example, CTGOV2 provides minimum and maximum age eligibility limits, while CTIS only offers age groups. Since different registries may have overlapping or unique data, having all IDs allows you to compare the sources or select the one that is best suited.

Would such a tabular format be alright, or would a list such as above be needed

A tabular format can be even more useful as it preserves information on the source registry.

and if yes, how would the name of the list items be specified by the user?

In case you implement a list instead, the names could follow the same logic as dbFindIdsUniqueTrials() currently does, using preferregister, if I understand your question correctly.

@rfhb
Copy link
Owner

rfhb commented Feb 10, 2025

Thanks for explaining and glad you were not blocked.

It is not yet clear how in general to advance ctrdata for ICTRP. I received no reply when I asked mid-2023 for access (which seems to be called crawling there).

Perhaps there should be a dedicated function to generate such a concordance table of identifiers, e.g. dbGetConcordantIds(). This function would then be called by dbFindIdsUniqueTrials(). Overall, this could be efficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants