Best practice for properties with external controlled vocabularies #142
-
What is the best practice for properties that should be constrained based upon an external controlled vocabulary? For example, suppose we are defining a JSON Schema for a healthcare encounter and wish to constrain a procedureCode field to use terms from LOINC (http://loinc.org). For small controlled vocabularies, enum is the obvious choice. But enum seems inadequate when (a) the list of values is very large and/or (b) the list of values is controlled by another organization. I can think of a few alternatives to enum for this use case, but I'm curious if the community has identified best practice in this area:
Do any of the above options seem reasonable? Is there an alternative approach that the community considers to be best practice in this area? |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 3 replies
-
This isn't valid in JSON Schema. Are you proposing that it be valid? What is on the other end of that URI? There is already wording in the specification about extra vocabularies, and a few implementations already support vocabulary extensions: https://json-schema.org/draft/2020-12/json-schema-core.html#rfc.section.8.1 |
Beta Was this translation helpful? Give feedback.
-
The "best practice" solution is to define your own vocabulary with a new keyword which does what you want. The problem with this approach is you might struggle to find tooling which supports "vocabularies" right now. Your considerations are, how is the schema going to be used? Is it going to be used by only you internally and no one else? Or is it going to be shared externally. If it's going to be shared externally, you should consider defining your own vocabulary. Previously similar work has been done to point to an ontology as a controlled vocabulary using a custom keyword. Adding custom keywords without defining a vocabulary is possible, but then you have interoperability issues if someon using the schema doesn't get the memo about the addtional keyword. (Unknown keywords in JSON Schema are ignored, and so the validation using that keyword simply wouldn't happen.) You should also consider the fact that you might be adding non-determanistic validation results. |
Beta Was this translation helpful? Give feedback.
-
Hi @mnizol and @Relequestual, very interesting discussion! A variant of 1 is to make complete ontology term objects part of the schema. Some semi-standard datamodels here include FHIR codeableconcepts: The Human Cell Atlas developed their own extension to JSON-Schema and built validators that use the EBI Ontology Lookup Service. See for example: This schema is used for properties whose values must be terms from CCO or GO they use an "ontology" object to represent an ontology term: "ontology": {
"description": "An ontology term identifier in the form prefix:accession.",
"type": "string",
"graph_restriction": {
"ontologies" : ["obo:hcao", "obo:go"],
"classes": ["GO:0007049", "GO:0022403"],
"relations": ["rdfs:subClassOf"],
"direct": false,
"include_self": false
},
"user_friendly": "Cell cycle ontology ID",
"example": "GO:0051321; GO:0000080"
}, You can see the "graph_restriction" (not part of json-schema of course) that describes dynamically the set of allowed terms. This kind of pattern is sometimes known as an "intensional value set". The adaptive immune receptor repertoire (AIRR) standard has a not dissimilar pattern for intensional value sets Their schema is Open-API using the x-airr extension but the same patterns could be applied to json-schema https://github.com/airr-community/airr-standards/blob/master/specs/airr-schema-openapi3.yaml This is an example from AIRR where the value of the species property must come from the "gnathostome" branch OBO ontology rendering of NCBITaxon (this seems like an obscure root node but it's to do with conservation of key parts of the vertebrate immune system): species:
$ref: '#/Ontology'
nullable: false
description: Binomial designation of subject's species
title: Organism
example:
id: NCBITAXON:9606
label: Homo sapiens
x-airr:
miairr: essential
adc-query-support: true
set: 1
subset: subject
name: Organism
format: ontology
ontology:
draft: false
top_node:
id: NCBITAXON:7776
label: Gnathostomata And in @Relequestual's own GA4GH schema: There is a pattern to ensure the CURIE matches but no checks on branches This is similar to the concept of "intensional value sets" that are found in clinical informatics (see for example the VSAC server: https://vsac.nlm.nih.gov/) and are formally encoded using FHIR: https://build.fhir.org/valueset.html#int-ext UPDATE (2022-08-12) I posted about this particular problem here https://douroucouli.wordpress.com/2022/07/15/using-ontologies-within-data-models-and-standards/, including a solution for LinkML, that can be compiled to JSON-Schema UPDATE (2022-09-16) There was some previous work done on extending JSON Schema to support ontologies at the biohackathon, notes here |
Beta Was this translation helpful? Give feedback.
-
Can't a enum be replaced with |
Beta Was this translation helpful? Give feedback.
The "best practice" solution is to define your own vocabulary with a new keyword which does what you want.
The problem with this approach is you might struggle to find tooling which supports "vocabularies" right now.
Your considerations are, how is the schema going to be used?
Is it going to be used by only you internally and no one else? Or is it going to be shared externally.
If it's going to be shared externally, you should consider defining your own vocabulary. Previously similar work has been done to point to an ontology as a controlled vocabulary using a custom keyword.
Adding custom keywords without defining a vocabulary is possible, but then you have interoperability issues if som…