Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Design questions for HDFStore.append #60920

Open
JakeTT404 opened this issue Feb 12, 2025 · 3 comments
Open

API: Design questions for HDFStore.append #60920

JakeTT404 opened this issue Feb 12, 2025 · 3 comments
Labels
API Design IO HDF5 read_hdf, HDFStore Needs Discussion Requires discussion from core team before further action

Comments

@JakeTT404
Copy link
Contributor

I'm new and am trying to put together a PR for documentation improvements surrounding HDFStore. I've been looking at HDFStore.append() and have encountered several instances where I am unsure of what the best practice is:

  • Many parameters are unused or have no effect by nature. Do we keep these and document them with an additional statement that they are unused, or do we remove them/un-document them?
  • Some parameters do have effects but their variable names are misleading such as min_itemsize that defines the max size of string columns and axes that change the axis of appending (though technically correct can be misleading). Do we keep them as is and document them as normal or modify the names throughout?
  • Some parameters have default values of None which are later overwritten by sub-function default values (e.g. nan_rep and chunksize). Do we back-propagate the default values into all sub-functions such that we can document correct default values?
  • Is there a point in having an append parameter when there exists both append() and put() which act as near identical wrappers for the same function _write_to_group()?

Any opinions or resources regarding this will be helpful.

@rhshadrach
Copy link
Member

rhshadrach commented Feb 12, 2025

Thanks for the report! I think what you're doing is raising the question of whether or not there are certain deficiencies that should be corrected or merely documented. That's the right approach, but pandas uses QST for questions from users about using pandas. I've reworked this as a API report.

Many parameters are unused or have no effect by nature.

Can you detail these?

Some parameters do have effects but their variable names are misleading such as min_itemsize

This goes back to

94bd4e7

I don't understand why the name min_itemsize was chosen; I am positive on changing this to max_itemsize.

axes that change the axis of appending (though technically correct can be misleading).

Can you detail why you find this misleading?

Some parameters have default values of None which are later overwritten by sub-function default values (e.g. nan_rep and chunksize).

For each parameter, what do you believe the default should be?

Is there a point in having an append parameter

I think this is a convenience to do e.g. df.to_hdf(..., mode="a", append=True). What would the alternative look like without an append parameter?

@rhshadrach rhshadrach changed the title QST: What modifications are expected/good practice for DOC contributions? API: Design questions for HDFStore.append Feb 12, 2025
@rhshadrach rhshadrach added API Design IO HDF5 read_hdf, HDFStore Needs Discussion Requires discussion from core team before further action labels Feb 12, 2025
@rhshadrach
Copy link
Member

Is there a point in having an append parameter

Ah, I confused df.to_hdf with HDFStore.append; I think you're talking about the latter. This is less clear to me, will need to dig into it more.

I think this also clears up my confusion on unused parameters. I haven't checked, but could the signature include unused parameters for a consistent signature across many functions, others of which do use the parameters? I have not checked.

@JakeTT404
Copy link
Contributor Author

I have put together the following notes regarding the parameters for the append function.

  • format for append can only be "table" as "fixed" does not support appending (remove?).
  • axes parameter must be an iterable containing a single value for the selection of a dataframe axis. It can only be two values [0] or [1] which change the axis used for indexing. from a user perspective, this only affects on which axis data is appended (document. rename possible but probably best not to).
  • index I believe writes the index column as a data column to allow on disk queries. See data_columns (document as such).
  • append is redundant (remove. refactor, see below).
  • complib and complevel both seem to function as intended (fix docs. tests?).
  • columns raises a TypeError immediatly saying to use data_columns instead (changed ages ago so just remove).
  • min_itemsize should be max (replace throughout).
  • nan_rep is default to "nan" downstream (add defaults to all functions using nan_rep).
  • chunksize is default to 100000 downstream (same as above).

There are 3 functions to write data to a hdf5 file, of which they all use the same underlying function. Below shows a rough diagram of the function calls for these methods (Note: only type "table"):

flowchart TB
    subgraph s2["HDFStore"]
        n3["append_to_multiple"] --> n1["append"]
        n2["put"] --> n4["_write_to_group"]
        n1 --> n4
    end
    n5["df.to_hdf"] --> n1 & n2
    n4 --> n6["write"]
    subgraph s1["AppendableTable"]
        n6 --> n7["_create_axes"] & n8["create_description"] & n9["write_data"]
    end
Loading

append, put and df.to_hdf all act as wrappers for the same function which is why they all have so many parameters. A good point is that due to this they can (mostly) share documentation (need to look into sharing docstrings). Best present course of action would likely be to convert them into pure wrappers (they each currently run their own checks so move these checks into _write_to_group) and then share their docstrings. In future, a full refactoring seems best since append and put merge way too early; ideally put would create the table then flow directly into append for writing. Also final thought is why put and not write?

Take these opinions with salt as I have little experience in large, community driven projects :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design IO HDF5 read_hdf, HDFStore Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

2 participants