-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect data dimensions when saving data by snirf_homer3 #4
Comments
@fangq, @dboas, Here's my thinking on this issue. Depending on the language and application the data is stored row or column-major:
The last option - "Neither" - means that the language does not store 2D (or >) arrays in contiguous memory. Only 1D arrays are contiguous and the other dimensions are references to other 1D arrays. Whichever type of language or application you use - row or column major storage, the HDF libraries will most likely compensate for a difference if there is one and so for instance you can save and load 2D arrays to and from an HDF5 file and get the right answer as long as you're using the same application or applications that agree in their storage order for multidimensional contiguous arrays. The problem happens as you says when the saver and loader of the file don't agree in their storage order. To me it's a bit arbitrary what scheme you use to fix these disagreements. Common sense might tell you that you should make sure to use row-major storage when saving because it agrees with HDF5 which is row major ... but there are other ways to address this interoperability issue besides what looks best. And in my opinion it is arbitrary which one you choose. Unless you have a formal rule you will get disagreements and interoperability confusion in the future around this issue. So I think what is needed is to add a formal specification about storage order to the SNIRF spec to take the ambiguity out of this issue. For instance you might have this:
or more generally:
|
@jayd1860, in a way, our specification defines the data structure in the hdf5 data structures, not in matlab array forms, nor C/python data strutures, thus, the hdf5 data constructs stored on disk need to meet our dimension/row/column requirements. if a snirf hdf5 file does not meet the spec, then it is not compliant. also, matlab's h5 api also suggest to transpose the data (and |
@jayd1860, let me do a more indepth test tomorrow. I anticipate one of our parsers will produce the correct dimensions (because snirf-homer3 transpose when reading and easyh5 does the transpose when saving), but for some reason, both outputs seem to have the incorrect dimensions when loading in python. I will need to look at how python's h5py load data. |
@fangq, I am doing this now in agreement with you that a column major language should convert to the file storage order (that is to row-major). However I am finding as perhaps you are as well that the Matlab rules to do this aren't as simple as transpose when saving and transpose again when loading. It looks a bit trickier than that. I think I found the rule and am testing it now. Will let you know if it works. Jay |
hi @jayd1860, I am about to post my findings - easyh5 also had similar issues, but I just fixed it, and the fix was easy, see my own ticket here https://github.com/fangq/easyh5/issues/10 and involved changes can be found here essentially, you call
before saving or after loading. Now the hdf5 data order and matlab element order agree to each other. If you don't want to manually permute, you can in theory write a flag in the array header telling hdf5 that the array order is in fortran/matlab, see https://support.hdfgroup.org/HDF5/doc1.8/H5.intro.html#Intro-PMCreateArray unfortunately, this flag can only be used in C library, and matlab's https://www.mathworks.com/help/matlab/ref/h5t.array_create.html In short, add a once you update, we can recreate the sample data files, and notify other developers that these files have changed the array layout. |
@jayd1860, I recreated the sample data files using my updated easyh5 toolbox in this set of files, the dataset dimensions stored in the hdf5 matches the specification (and also agrees the data loaded to matlab using easyh5 and python using h5py). Please feel free to test your parser using these new sample data. |
Hi @fangq, My SNIRF reader can't read these samples because it expects the data and aux to have index suffixes and it doesn't look like they are there. Whereas I see that stim does have the array subscripts. All 3 are "HDF5 group with one or multiple sub-groups (i.e. an indexed-group)" according to the spec. Below are excerpts first from current snirf-samples vs what snirf-homer3 generates. I think the snirf-samples .snirf files should have array subscripts appensded to the group name. ===============================
snirf_homer3:
|
@jayd1860, this is what we decided for the indexed group numbering:
I highlighted the condition "when multiplicity presents". I remember this originally was proposed by Ted when we migrated the language from matlab struct to HDF5 structures, and the rule was when there is a singular subfield, then, its name does not have to add 1 at the end, i.e.
let me know if this is what you have in mind. |
@fangq, @dboas. That's strange, I remember us deciding this only regarding the top level group /nirs, not any group. Also here's the whole line in question: "In the below sections, we use the notations "(i)" "(j)" or "(k)" inside the HDF5 location paths to denote the indices of sub-elements when multiplicity presents." It kind of sounds to me like it's not at all talking about having different naming conventions for indexed groups. Rather what it seems to be saying is that indexed groups (vs non-indexed such as probe) are referred to in the text using the above-mentioned index notation. I would say it's not completely clear, but that's my interpretation of it. Lastly it seems like it would be adding a needless exception to a simple rule and a needless extra step in the code to satisfy that exception. Why do that here? Maybe Ted can weigh in |
it kind if says both - indexed groups will be referred to later using these notations, and, the numbering shall be appended when multiple records are included. At least that I had in mind when I added this language in this commit: fNIRS/snirf@25bfac9#diff-ee33e2b4d0875712a7909355275c703aR79
no, instead, it is meant to be the only rule for all indexed arrays - instead of letting I like to hear what @huppertt thinks of this. On a side note, I have to say that appending a number in the group names is probably the most fragile feature in this whole spec. For example, it prevents us from using any group names that ends with a digit, and requires us to check for number continuity when reading&saving. We previously did not have this issue because it was defined in MATLAB struct, and matlab data support struct-arrays and cell-arrays. However, when we decided to store this in an HDF5 structure, we found that struct arrays or cells arrays do not really exist in HDF5, here, everything is named (does not support indexing). I think this is a limitation of HDF5, but right now, I don't see a solution except that the index can be ignored when there is a singular record. |
I thought we had agreed that all "array" fields including data and aux would have a numeric index. So /nirs/data1 is allowed but /nirs/data would not be. So all arrays (even if they had a single entry) still had to be indexed.
The exception to this was the /nirs field itself, which we said could be /nirs or /nirs1. The reason for this exception was that all data except hyperscanning would probably only have a single /nirs entry. Thus, in most cases this would not be an array.
An advantage of hdf5 is that you can add fields without having to rewrite existing ones. Eg you can add a /data2 without needing to rebuild the whole file. If you allow /data to not have a number, then the write code would need to change indices on existing fields to add.
With that said, I think my code handles the case of a missing index (eg /nirs/data) in the read code (because it seemed nicer to catch this and issue a warning but continue than to ab-end with an exception).
Ted
Theodore Huppert, PhD
University of Pittsburgh
Department of Electrical and Computer Engineering
Email: [email protected]
Phone: 1-412-647-8459
Website: www.huppertlab.net
…________________________________
From: Qianqian Fang <[email protected]>
Sent: Tuesday, May 19, 2020 7:39:39 PM
To: fNIRS/snirf_homer3 <[email protected]>
Cc: huppertt <[email protected]>; Mention <[email protected]>
Subject: Re: [fNIRS/snirf_homer3] Incorrect data dimensions when saving data by snirf_homer3 (#4)
It kind of sounds to me like it's not at all talking about having different naming conventions for indexed groups. Rather what it seems to be saying is that indexed groups (vs non-indexed such as probe) are referred to in the text using the above-mentioned index notation.
it kind if says both - indexed groups will be referred to later using these notations, and, the numbering shall be appended when multiple records are included. At least that I had in mind when I added this language in this commit: fNIRS/snirf@25bfac9#diff-ee33e2b4d0875712a7909355275c703aR79<fNIRS/snirf@25bfac9#diff-ee33e2b4d0875712a7909355275c703aR79>
I like to hear what @huppertt<https://github.com/huppertt> to say on this.
On a side note, I have to say that appending a number in the group names is probably the most fragile feature in this whole spec. For example, it prevents us from using any group names that ends with a digit, and requires us to check for number continuity when reading&saving. We previously did not have this issue because it was defined in MATLAB struct, and matlab data supports struct-arrays or cell-arrays. However, when we decided to store this in an HDF5 structure, we found that struct arrays or cells arrays do not really exist in HDF5, here, everything is named (does not support indexing). I think this is a limitation of HDF5, but right now, I don't see a solution except that the index can be ignored when there is a singular record.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#4 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ALNU3OSK7UF35T24EQSPBFDRSMKDXANCNFSM4M66BIGA>.
|
I don't mind explicitly adding language to require all indexed groups to have the index, but this will make two rules - if we do want to change this, we will also have to use some new notations, such as converting to
my code can also handle missing indices as well. I also wrote a dedicated function, |
What Ted is saying about "array fields including data and aux would have a numeric index" is what I remember us saying back then. |
So, do we have convergence on the solution?
Can someone summarize it?
From: jayd1860 <[email protected]>
Reply-To: fNIRS/snirf_homer3 <[email protected]>
Date: Wednesday, May 20, 2020 at 6:50 AM
To: fNIRS/snirf_homer3 <[email protected]>
Cc: "Boas, David" <[email protected]>, Mention <[email protected]>
Subject: Re: [fNIRS/snirf_homer3] Incorrect data dimensions when saving data by snirf_homer3 (#4)
What Ted is saying about "array fields including data and aux would have a numeric index" is what I remember us saying back then.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#4 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AHFCP5DUFEC7543ZINNX3FTRSOYWJANCNFSM4M66BIGA>.
|
in the current spec draft, we use a single rule to describe all indices in a group name - the group index can be ignored when there is a single dataset under this group, no matter it is I suppose what Ted and Jay remembered was that our previous decision was that all indexed group names must followed by an index, even there is only a single dateset beneath; however, I personally like the first (the current) naming approach, but am happy to change it to the 2nd if we all agree. The issue is that the notation will be a bit cumbersome because we will have to use separate notations for |
I am in favor of the "that all indexed group names must followed by an index, even there is only a single dateset beneath; however, /nirs is a exception - when there is a single measurement set, it can be written as /nirs instead of /nirs1" version. There is an ambiguity if you allow it both ways for many of the fields when the file (inappropriately) has both entries. In cases like the measurementList, the index needs to match the index of the data. Eg if there were a /measlist and /measlist1, which one describes the first entry of the data? The correct answer (following spec) is that this is never allowed but it would be up to the file reader to handle this as an error exception. At least the code ignoring and not loading unnumbered entries would throw obvious errors when the fields were missing, but the ambiguous issue could result in the data still reading but being wrong (eg nearly impossible for the end user to debug without digging in the snirf read function(
I don't think /nirs has this ambiguity, but if having this one exception causes concern, I would be ok with the "numbered even if there is only one entry" applying to the /nirs1 level too.
Theodore Huppert, PhD
University of Pittsburgh
Department of Electrical and Computer Engineering
Email: [email protected]
Phone: 1-412-647-8459
Website: www.huppertlab.net
…________________________________
From: Qianqian Fang <[email protected]>
Sent: Wednesday, May 20, 2020 9:18:14 AM
To: fNIRS/snirf_homer3 <[email protected]>
Cc: huppertt <[email protected]>; Mention <[email protected]>
Subject: Re: [fNIRS/snirf_homer3] Incorrect data dimensions when saving data by snirf_homer3 (#4)
So, do we have convergence on the solution? Can someone summarize it?
in the current spec draft, we use a single rule to describe all indices in a group name - the group index can be ignored when there is a single dataset under this group, no matter it is nirs or data or measurementList.
I suppose what Ted and Jay remembered was that our previous decision was that all indexed group names must followed by an index, even there is only a single dateset beneath; however, /nirs is a exception - when there is a single measurement set, it can be written as /nirs instead of /nirs1.
I personally like the first (the current) naming approach, but am happy to change it to the 2nd if we all agree. The issue is that the notation will be a bit cumbersome because we will have to use separate notations for /nirs and other group names.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#4 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ALNU3OWNZ3WEI5GLYWBKI7TRSPKBNANCNFSM4M66BIGA>.
|
Again agreeing with what Ted is saying on the naming convention. I don't mind having top level /nirs be the only exception unless someone objects to this. |
alright, I will regenerate the sample files and clarify the language. |
@jayd1860, sample files are regenerated, please test again.
|
hi @jayd1860 and @dboas, as more groups are working on adding supports to snirf (see mne-tools/mne-python#7057, brainstorm-tools/brainstorm3#283, fNIRS/snirf-samples#6) we need to make sure the sample files are consistent, not only across our own matlab parsers, but also parsers written with other languages.
@jayd1860, according to our previous conversation, I know you transpose data in the loading phase of a
.snirf
file instead of the saving phase. This is not an issue if one uses snirf_homer3 for both saving and loading, but can cause an issue for other parsers.I just did a quick test using python, and the snirf_homer3 generated sample data shows incorrect data dimensions, here are the script
first, installing python hdf5 support (and numpy) via
then start python, and load the
neuro_run01.snirf
file that I previously created using homer3's snirf parserthis is inconsistent with our specification, where the dimensions is defined as
#time points x #channels
.The transposed data is a result of directly saving a column-major matlab array to a row-majored HDF5 dataset.
I expect that @huppertt's C# parser will also experience similar transpose issues.
In comparison, my easyh5 toolbox transposes the data before saving to hdf5, so it is consistent if the file is opened with other programming languages.
Please let me know if you are able to fix this, and I will be happy to recreate the sample file in the snirf-samples repo so that others can test.
The text was updated successfully, but these errors were encountered: