-
Notifications
You must be signed in to change notification settings - Fork 336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support string type inputs from Python application code. #2629
Support string type inputs from Python application code. #2629
Conversation
Signed-off-by: Yasushi Negishi <[email protected]>
Signed-off-by: Yasushi Negishi <[email protected]>
Signed-off-by: Yasushi Negishi <[email protected]>
Signed-off-by: Yasushi Negishi <[email protected]>
Signed-off-by: Yasushi Negishi <[email protected]>
Signed-off-by: Yasushi Negishi <[email protected]>
Signed-off-by: Yasushi Negishi <[email protected]>
@gongsu832 This PR does not support JNI. Can you kindly give me comments to support to enable the JNI part, or create another PR to support string inputs with JNI? |
Let's first decide how users would typically input a tensor of strings for the languages we support. For example, let's say a model specifies an input tensor
Regardless of how the string tensor is inputted in various languages, ultimately they will all be converted to the Also, I would not change the meaning of the |
@gongsu832 Thanks for your variable comments.
Yes. Although this PR focuses to support Python, but other languages should be considered.
For Python, "pybind11" converts muti-dimentional Python arrays into the multi-dimentional std::vector types in c++(e.g.
The code copies strings in the C++ vector types into one-dimentional array with shape info (1) in order to change the array style (from C++ multi-dim vector to C one-dim array) and (2) duplicate buffers for the string data for keeping the string buffers until "omTensorDestroy." This is only one copy for each string. I agree that copies should be minimized, and I suppose that this copy in PR are minimum and mandatory.
Yes. exactly. I suppose that there are no such cases now, but I am not sure especially about other language cases. Thank you for your useful comments. I am not sure my comments answers all of your questions, but more comments and suggestions are very welcome! |
I think we should make the string tensor behave as similar to non-string tensor as possible. Currently, C/C++ and Java APIs use the combination of an one-dimensional data array with a shape array instead of a multi-dimensional array for the input/output tensors . This has the advantage that the one-dimensional data array is tightly packed in contiguous memory and therefore they can be passed into the model runtime via So back to the string tensor, I think it should also use a combination of one-dimensional string array plus a shape array for input/output string tensors, i.e., something like Regarding the |
Thank you for the comments!
I suppose that this PR uses one-dementional data array (of pointers to strings), but does not use multi-dimentional array. In this PR, string data are kept in the one-dimentional data array, and shapes are kept in the shape structure. It is same to numerical data cases. The only difference is that the one-dimentional data array keeps pointers instead of real values, since the data lengths of string elements in an array are different unlike numerical data cases.
I suppose that this PR keeps the one-dimentional array structure. A string tensor behaves "almost identical" to an int64 tensor as described in this comment.
I understand the idea, but onnx-mlir runtime code needs to hold the string data (not pointers) in the OMTensor structure until "omTensorDestroy" is called, because Python applications free string data used for an OMTensor for the first argument before generating another OMTensor for the next argument. Thank you very much for the comments, and I am sorry if my explanations were not enough or I do not understand your comments. Your comments are important and very welcome to me. |
What I meant is that the Python interface, instead of using a multi-dimensional numpy array, should have used an one dimensional array plus a shape array, like the other languages.
I'm not sure I understand this. Can you provide an example? |
include/onnx-mlir/Runtime/OMTensor.h
Outdated
* @return pointer to OMTensor created, NULL if creation failed. | ||
* | ||
*/ | ||
#define OMTENSOR_OWNING_DATA_PTR 1 | ||
#define OMTENSOR_OWNING_DATA_PTR_AND_STRING 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your change can support pointers to data type other than string, right?
The ownership is kind of pointer depth for data_ptr
0: data_ptr
1: *data_ptr
2: *(*data_ptr+i)
For the input tensor of string type, I feel that its OMTensor owns *data_ptr like other type of tensor, but not **data_ptr because those objects are allocated by Python. Is my assumption correct?
Is there any case in onnx operation that a totally new string will be generated? Or we just copy the ptr of string to new tensor, in a similar way that we copy integer or float data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _owning
flag tells whether the generated model runtime owns the tensor data. It's not related to whether a language wrapper like JNI or pybind11 creates its own copy of the tensor data for whatever purpose before calling the runtime OMTensorList/OMTensor
API. Those copies should be manage by the wrapper itself, not the model runtime. So typically, for input tensors, the _owning
flag should always be false and for output tensors the _owning
flag should always be true. The one exception where the _owning
flag is false for output tensors is when the tensor data is static and cannot be freed.
This is the reason why I'm not sure why we may have the cases here where the pointers and string data may be owned by different parties and asked @negiyas for an example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current implementation stores char * pointer in the Tensor of string. If a Tensor of string and the strings are created inside the compiler, we will have to allocate the space for data_ptr and for the strings. That will be a new case for the ownership. If the Tensor of strings are just to propagate pointers from one tensor to another, the ownership will be the same as the float or integer data type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean there can be multiple output string tensors sharing the same string data? Internally how the strings are passed around doesn't really matter. What matters is how the strings are passed out of the model runtime in the output tensor(s).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gongsu832 Thanks for the comments. I changed the code according to your comments. I updated the code to use the OMTensor->_allocated buffer for string data themselves. And there are no modifications on src/Runtime/OMTensor.inc from the original code now.
The string data by allocated by Python applications are copied into memory area in the OMTensor->_allocated buffer, which follows the area for array of string pointers (in the PyExecutionSessionBase::pyRun function in PyExecutionSessionBase.cpp). The _allocated buffer is freed in the mTensorDestroy like other cases.
I hope that this update can answer your concerns related to the OMTensor->_owning flag.
Please let me know your additional comments and suggestions.
(Let me explain my thought about the multi-dimensional array in Python by another comments.)
Signed-off-by: Yasushi Negishi <[email protected]>
Thanks. I have got the point now. My comments are as follows. (1) The current implementation uses multi-dimensional numpy for input/output Tensors, that is a standard array format in Python. This PR simply follows the current one for the numerical data cases, does not change the policy including multi-dimensional array treatment. |
I understand you are trying to follow the current way of using numpy multi-dimensional array for the Python wrapper. But that's exactly what I'm questioning, that whether this is the best way of doing it. I think using a multi-dimensional numpy array actually makes things unnecessarily complicated because you have to "flatten" it to what |
…instead of buffers besides the _allocated buffer. Signed-off-by: Yasushi Negishi <[email protected]>
Signed-off-by: Yasushi Negishi <[email protected]>
Signed-off-by: Yasushi Negishi <[email protected]>
Because "pybind11" uses shape defined by the shape attribute of numpy, we cannot flatten multi-dimensional arrays in Python applications to keep the original shape for pybind11.
Yes, exactly. As described in your comments, for the numerical data cases, we need not flatten multi-dimensional array manually, I tried to follow the same way for string data cases, but I could not follow the same way for string data because of a pybind11 issue ( pybind/pybind11#1538. This page shows a solution but it does not work for our case). I will update the code if we find better way to avoid the issue or pybind11 fixes the issue. I appended the following comments in the code to explain the situation.
|
The current
takes a vector of This way you will be able to handle arbitrary dimension without having to hardcode a maximum supported dimension. That part of the code is really ugly. Yes this means that more test code will have to be changed, e.g., code like
will need to be changed to something like
But IMO this is the right way to do it. |
Signed-off-by: Yasushi Negishi <[email protected]>
Signed-off-by: Yasushi Negishi <[email protected]>
Signed-off-by: Yasushi Negishi <[email protected]>
Signed-off-by: Yasushi Negishi <[email protected]>
@AlexandreEichenberger Thanks for your thorough comments.
Let me answer this bigger question at first. I will answer other questions next.
Thanks for the summary. It is correct.
The answers are yes and yes.
Yes. I agreed with @gongsu832 to have class with 3-argument run class for execution, and developed this PR based on your summary. Now I have realized that we can develop this PR, and have updated it, which preserves the external Python interfaces, and still has the same internal("pybind") interface with three arguments.
I hope that the latest PR answers most comments of @AlexandreEichenberger and @gongsu832.
I suppose that this PR has a negligible impact on performance. Although we need to pass two additional arguments (shapes and strides) though the "pybind" module, the arguments are quite small (1D with input rank size for each argument) and are passed only once at whole module execution. |
Signed-off-by: Yasushi Negishi <[email protected]>
@jenkins-droid test this please |
Signed-off-by: Yasushi Negishi <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the changes, I am grateful that you found a way to not change the external interface while satisfying the needs for the 3-arguments run method. Please confirm that it is indeed the case.
I left 2 small nits, approved the PR but believe you should correct them. One is about the relative include path, and the other is about the copyright not being updated to 2024.
Thanks for the good work.
@@ -21,7 +21,7 @@ | |||
|
|||
namespace py = pybind11; | |||
|
|||
#include "ExecutionSession.hpp" | |||
#include "../ExecutionSession.hpp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, I believe in most places we give the path from the onnx-mlir
root dir. I would use that here if you don't mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Thanks!
src/Runtime/python/PyRuntime.py
Outdated
|
||
############# PyOMRuntime.py ####################################### | ||
# | ||
# Copyright 2021-2023 The IBM Research Authors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, please update all of the copyright to include year 2024, here and elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated "2023" to "2024", in files included in this PR, and other files including "-2023" still remains.
PS: I see the python format failing for other PRs as well, so no need to worry about this.\ I asked @gongsu832 to look into why it reports a failure |
I think there might be some misunderstanding. The external interface will change to be 3 arguments. The original 1-argument interface uses shape info embedded in the numpy array and the python-to-native conversion is done by pybind11 directly. But it has trouble supporting string data type because the multi-dimensional array must be "flattened" to 1-dimension by copying each dimension manually in C/C++ since pybind11 currently does not directly support string data type. With the 3-argument interface, the numpy array is "flattened" in the new python wrapper (and hence the need for additional arguments to carry shape info) and this "flatten" typically is a simple "view change" without copying. This 3-argument interface is also more similar to C/C++ and Java interfaces. |
The problem is that MacOS is now using a newer version of black 24.1.0. So on your local machine you should upgrade black to that version. |
@jenkins-droid test this please |
Signed-off-by: Yasushi Negishi <[email protected]>
Signed-off-by: Yasushi Negishi <[email protected]>
@gongsu832 I updated the black command, and also updated format of utils/analyze-simd.py and utils/pre-onnx-mlir.py to fix the black format errors. Thanks. |
Yes. Exactly. Thanks for pointing out this issue. I remember this issue again now:-). |
Jenkins Linux amd64 Build #13930 [push] Support string type inpu... started at 22:09 |
Jenkins Linux s390x Build #13957 [push] Support string type inpu... started at 23:09 |
Jenkins Linux ppc64le Build #12954 [push] Support string type inpu... started at 23:18 |
Jenkins Linux s390x Build #13957 [push] Support string type inpu... passed after 1 hr 27 min |
Jenkins Linux amd64 Build #13930 [push] Support string type inpu... passed after 1 hr 31 min |
Jenkins Linux ppc64le Build #12954 [push] Support string type inpu... passed after 1 hr 51 min |
After pulling in this change, I'm noticing that the PyRuntime files are coming out as As is, that would then break existing client applications. Their programs would need to be updated for the change. Also I haven't tried it yet, but I'm guessing the onnx-mlir examples at https://github.com/onnx/onnx-mlir/blob/main/docs/mnist_example/mnist-runPyRuntime.py#L4 are broken because it still has If it breaks clients, then that should be a major version change (ie 0.5.0) for onnx-mlir. Is that desired? For zDLC, after generating the |
Sorry I didn't notice this when reviewing the PR. I think the change is due to the original @cjvolzka keep in mind that the original |
Actually I was wrong on this. I missed that the .run() interface changed. So that wouldn't actually work to avoid breaking exploiters. See comment at #2629 (comment) |
This PR supports string type inputs from Python application code.
numpy.array([["abc", "d"], ["e", "fghi"]], dtype=np.str_)
).The basic strategies are as follows.
inputPyArray.cast<std::vector<std::string>>();
for 1D array).(Additional comments)
Comments on the basic strategy and testing policies are very welcome.