Feature request: option to return matched duplicate IDs in dbFindIdsUniqueTrials() #45

machado-t · 2025-02-08T19:02:10Z

Currently, dbFindIdsUniqueTrials() returns a vector of unique trial IDs after deduplication. However, it does not indicate which IDs were matched/merged into each ID during deduplication.

An option to return matched duplicate IDs would allow for users to compare and analyze data points from multiple registries for the same trial.

Example implementation:

Add an optional parameter (e.g., returnMatches) so that:

WhenreturnMatches = FALSE (default): The function returns a vector of unique IDs (as it currently does).
When returnMatches = TRUE: The function returns a named list where each key is a unique trial ID and each value is the vector of all duplicate IDs merged into that unique trial. For example:

result <- list(
  "2024-512345-67-00" = c("2024-512345-67-00", "NCT01234567", "2023-002345-67"),
  "NCT01234568" = c("NCT01234568"),
  "2024-512346-78-00" = c("2024-512346-78-00", "2023-002356-89")
)

Thank you!

The text was updated successfully, but these errors were encountered:

rfhb · 2025-02-09T17:52:55Z

Thanks @machado-t. The data frame outSet may have the sought information, it is created as an intermediate here

ctrdata/R/dbFindIdsUniqueTrials.R

Line 326 in aa96076

# find duplicates

and looks like this, with one row representing one trial:

This could be cleaned up (e.g., SPONSOR includes the protocol code) and returned; possibly after refactoring of dbFindIdsUniqueTrials() to pull out the function generating this concordance table.

Please, what are the use cases? Would such a tabular format be alright, or would a list such as above be needed, and if yes, how would the name of the list items be specified by the user?

machado-t · 2025-02-10T15:27:09Z

Thank you very much! I managed to get what I needed using outSet via the debug function, so please don't worry too much about implementing this.

Please, what are the use cases?

What I am trying to solve relates to an old database from multiple registries (ICTRP). I need to link it with a dataset collected with ctrdata, which is only possible if I know which trial IDs were merged.

Knowing which trials IDs were matched also allows other use cases. For example, CTGOV2 provides minimum and maximum age eligibility limits, while CTIS only offers age groups. Since different registries may have overlapping or unique data, having all IDs allows you to compare the sources or select the one that is best suited.

Would such a tabular format be alright, or would a list such as above be needed

A tabular format can be even more useful as it preserves information on the source registry.

and if yes, how would the name of the list items be specified by the user?

In case you implement a list instead, the names could follow the same logic as dbFindIdsUniqueTrials() currently does, using preferregister, if I understand your question correctly.

rfhb · 2025-02-10T18:41:45Z

Thanks for explaining and glad you were not blocked.

It is not yet clear how in general to advance ctrdata for ICTRP. I received no reply when I asked mid-2023 for access (which seems to be called crawling there).

Perhaps there should be a dedicated function to generate such a concordance table of identifiers, e.g. dbGetConcordantIds(). This function would then be called by dbFindIdsUniqueTrials(). Overall, this could be efficient.

rfhb self-assigned this Feb 8, 2025

rfhb added enhancement under investigation labels Feb 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: option to return matched duplicate IDs in dbFindIdsUniqueTrials() #45

Feature request: option to return matched duplicate IDs in dbFindIdsUniqueTrials() #45

machado-t commented Feb 8, 2025 •

edited

Loading

rfhb commented Feb 9, 2025 •

edited

Loading

machado-t commented Feb 10, 2025

rfhb commented Feb 10, 2025

Feature request: option to return matched duplicate IDs in dbFindIdsUniqueTrials() #45

Feature request: option to return matched duplicate IDs in dbFindIdsUniqueTrials() #45

Comments

machado-t commented Feb 8, 2025 • edited Loading

rfhb commented Feb 9, 2025 • edited Loading

machado-t commented Feb 10, 2025

rfhb commented Feb 10, 2025

machado-t commented Feb 8, 2025 •

edited

Loading

rfhb commented Feb 9, 2025 •

edited

Loading