You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, Pandas does not provide a native way to store and retrieve complex data types like NumPy arrays (e.g., embeddings) in formats such as CSV without converting them to strings. This results in a loss of structure and requires additional parsing during data retrieval.
Many machine learning practitioners store intermediate results, including embeddings and model-generated representations, in Pandas DataFrames. However, when saving such data using CSV, the complex data types are converted to string format, making them difficult to work with when reloading the DataFrame.
Current Workarounds:
Pickle: While Pickle retains structure, it is not cross-platform friendly and has security concerns when loading untrusted files.
Parquet: Parquet supports complex structures better but may not be the default choice for many users who rely on CSV.
Manual Parsing: Users often need to reprocess stringified NumPy arrays back into their original format, which is inefficient and error-prone.
Feature Request:
Introduce an option in Pandas to serialize and deserialize complex data types when saving to and loading from CSV, possibly by:
Allowing automatic conversion of NumPy arrays to lists during CSV storage.
Providing a built-in method for reconstructing complex data types when reading CSV files.
Supporting a more intuitive way to store and load multi-dimensional data efficiently without requiring workarounds.
Feature Description
Modify to_csv() to Handle Complex Data Types
Introduce a new parameter (e.g., preserve_complex=True) in to_csv() that automatically converts NumPy arrays to lists before saving.
Pseudocode:
import pandas as pd
import numpy as np
class EnhancedDataFrame(pd.DataFrame):
def to_csv(self, filename, preserve_complex=False, **kwargs):
if preserve_complex:
df_copy = self.copy()
for col in df_copy.columns:
if isinstance(df_copy[col].iloc[0], (np.ndarray, list)): # Check for complex types
df_copy[col] = df_copy[col].apply(lambda x: json.dumps(x.tolist())) # Convert arrays to JSON
super().to_csv(filename, **kwargs)
If preserve_complex=True, NumPy arrays and lists are serialized into JSON format before saving.
The standard to_csv() functionality remains unchanged for other users.
Modify read_csv() to Restore Complex Data Types
Introduce a restore_complex=True parameter in read_csv() that automatically detects JSON-encoded lists and converts them back to NumPy arrays.
Pseudocode:
class EnhancedDataFrame(pd.DataFrame): @staticmethod
def from_csv(filename, restore_complex=False, **kwargs):
df = pd.read_csv(filename, **kwargs)
if restore_complex:
for col in df.columns:
if df[col].apply(lambda x: isinstance(x, str) and x.startswith("[")).all(): # Check for JSON format
df[col] = df[col].apply(lambda x: np.array(json.loads(x))) # Convert back to NumPy array
return df
If restore_complex=True, JSON-encoded lists are automatically converted back to NumPy arrays when reading a CSV file.
Example Usage
python
Copy
Edit
df = pd.DataFrame({'id': [1, 2], 'embedding': [np.array([0.1, 0.2, 0.3]), np.array([0.4, 0.5, 0.6])]})
Expected Benefits
✅ Users will be able to save and retrieve NumPy arrays, embeddings, or complex objects easily using CSV.
✅ Reduces the need for workarounds like Pickle or manual parsing.
✅ Keeps Pandas CSV handling more intuitive for machine learning workflows.
Alternative Solutions
Enhance documentation to recommend best practices for handling complex data types with Pandas and suggest an official approach for this use case.
Additional Context:
A common issue in ML workflows that involve embeddings, image vectors, and multi-dimensional numerical data.
Other libraries like PyArrow and Dask handle complex data better, but many users prefer Pandas for ease of use.
Additional Context
Using JSON Format Instead of CSV
Instead of saving complex data to a CSV file, users can save the DataFrame as a JSON file, which supports nested data structures.
Example
df.to_json("data.json", orient="records")
df_loaded = pd.read_json("data.json")
✅ Pros:
Natively supports lists and dictionaries without conversion.
Readable and widely supported format.
❌ Cons:
JSON files are not as efficient as CSV for large datasets.
JSON format is not always easy to work with in spreadsheet software.
2. Using Pickle for Serialization
Pandas provides built-in support for Pickle, which can store and retrieve complex objects.
Example
df.to_pickle("data.pkl")
df_loaded = pd.read_pickle("data.pkl")
✅ Pros:
Preserves complex data types natively.
Fast read/write operations.
❌ Cons:
Pickle files are not human-readable.
They are Python-specific, making them less portable for cross-platform use.
3. Using Parquet for Efficient Storage
Parquet is a columnar storage format optimized for performance and supports complex data types.
Example
df.to_parquet("data.parquet")
df_loaded = pd.read_parquet("data.parquet")
✅ Pros:
Efficient storage with better compression.
Supports multi-dimensional data and preserves data types.
❌ Cons:
Requires pyarrow or fastparquet dependencies.
Not as universally used as CSV.
4. Manual Preprocessing for CSV Storage
Users often manually convert complex data to JSON strings before saving them to CSV.
Example
import json
df["embedding"] = df["embedding"].apply(lambda x: json.dumps(x.tolist()))
df.to_csv("data.csv", index=False)
df_loaded = pd.read_csv("data.csv")
df_loaded["embedding"] = df_loaded["embedding"].apply(lambda x: np.array(json.loads(x)))
✅ Pros:
Works with existing Pandas functionality.
CSV remains human-readable.
❌ Cons:
Requires manual preprocessing each time.
Error-prone and inefficient for large datasets.
Why the Proposed Feature is Needed
While these alternatives exist, they require either additional dependencies, manual preprocessing, or compromise on format usability. Adding native support for preserving and restoring complex data types in Pandas CSV operations would:
Eliminate the need for external libraries like JSON, Pickle, or Parquet.
Improve usability for machine learning and data science workflows.
Keep CSV files human-readable while ensuring data integrity.
The text was updated successfully, but these errors were encountered:
Note you need to use the engine python rather than c as the latter does not (yet) support complex. I think adding support for complex to the c engine would be welcome.
Thank you for the clarification! I am interested in working on adding support for complex numbers to the C engine. Could you please guide me on how to get started or point me to any relevant developer resources? I look forward to contributing.
Feature Type
Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas
Problem Description
Currently, Pandas does not provide a native way to store and retrieve complex data types like NumPy arrays (e.g., embeddings) in formats such as CSV without converting them to strings. This results in a loss of structure and requires additional parsing during data retrieval.
Many machine learning practitioners store intermediate results, including embeddings and model-generated representations, in Pandas DataFrames. However, when saving such data using CSV, the complex data types are converted to string format, making them difficult to work with when reloading the DataFrame.
Current Workarounds:
Pickle: While Pickle retains structure, it is not cross-platform friendly and has security concerns when loading untrusted files.
Parquet: Parquet supports complex structures better but may not be the default choice for many users who rely on CSV.
Manual Parsing: Users often need to reprocess stringified NumPy arrays back into their original format, which is inefficient and error-prone.
Feature Request:
Introduce an option in Pandas to serialize and deserialize complex data types when saving to and loading from CSV, possibly by:
Allowing automatic conversion of NumPy arrays to lists during CSV storage.
Providing a built-in method for reconstructing complex data types when reading CSV files.
Supporting a more intuitive way to store and load multi-dimensional data efficiently without requiring workarounds.
Feature Description
Introduce a new parameter (e.g., preserve_complex=True) in to_csv() that automatically converts NumPy arrays to lists before saving.
Pseudocode:
import pandas as pd
import numpy as np
class EnhancedDataFrame(pd.DataFrame):
def to_csv(self, filename, preserve_complex=False, **kwargs):
if preserve_complex:
df_copy = self.copy()
for col in df_copy.columns:
if isinstance(df_copy[col].iloc[0], (np.ndarray, list)): # Check for complex types
df_copy[col] = df_copy[col].apply(lambda x: json.dumps(x.tolist())) # Convert arrays to JSON
super().to_csv(filename, **kwargs)
If preserve_complex=True, NumPy arrays and lists are serialized into JSON format before saving.
The standard to_csv() functionality remains unchanged for other users.
Introduce a restore_complex=True parameter in read_csv() that automatically detects JSON-encoded lists and converts them back to NumPy arrays.
Pseudocode:
class EnhancedDataFrame(pd.DataFrame):
@staticmethod
def from_csv(filename, restore_complex=False, **kwargs):
df = pd.read_csv(filename, **kwargs)
if restore_complex:
for col in df.columns:
if df[col].apply(lambda x: isinstance(x, str) and x.startswith("[")).all(): # Check for JSON format
df[col] = df[col].apply(lambda x: np.array(json.loads(x))) # Convert back to NumPy array
return df
If restore_complex=True, JSON-encoded lists are automatically converted back to NumPy arrays when reading a CSV file.
Example Usage
python
Copy
Edit
df = pd.DataFrame({'id': [1, 2], 'embedding': [np.array([0.1, 0.2, 0.3]), np.array([0.4, 0.5, 0.6])]})
Save with complex type handling
df.to_csv("data.csv", preserve_complex=True)
Load and restore complex types
df_loaded = pd.read_csv("data.csv", restore_complex=True)
print(df_loaded["embedding"][0]) # Output: array([0.1, 0.2, 0.3])
Expected Benefits
✅ Users will be able to save and retrieve NumPy arrays, embeddings, or complex objects easily using CSV.
✅ Reduces the need for workarounds like Pickle or manual parsing.
✅ Keeps Pandas CSV handling more intuitive for machine learning workflows.
Alternative Solutions
Enhance documentation to recommend best practices for handling complex data types with Pandas and suggest an official approach for this use case.
Additional Context:
A common issue in ML workflows that involve embeddings, image vectors, and multi-dimensional numerical data.
Other libraries like PyArrow and Dask handle complex data better, but many users prefer Pandas for ease of use.
Additional Context
Instead of saving complex data to a CSV file, users can save the DataFrame as a JSON file, which supports nested data structures.
Example
df.to_json("data.json", orient="records")
df_loaded = pd.read_json("data.json")
✅ Pros:
Natively supports lists and dictionaries without conversion.
Readable and widely supported format.
❌ Cons:
JSON files are not as efficient as CSV for large datasets.
JSON format is not always easy to work with in spreadsheet software.
2. Using Pickle for Serialization
Pandas provides built-in support for Pickle, which can store and retrieve complex objects.
Example
df.to_pickle("data.pkl")
df_loaded = pd.read_pickle("data.pkl")
✅ Pros:
Preserves complex data types natively.
Fast read/write operations.
❌ Cons:
Pickle files are not human-readable.
They are Python-specific, making them less portable for cross-platform use.
3. Using Parquet for Efficient Storage
Parquet is a columnar storage format optimized for performance and supports complex data types.
Example
df.to_parquet("data.parquet")
df_loaded = pd.read_parquet("data.parquet")
✅ Pros:
Efficient storage with better compression.
Supports multi-dimensional data and preserves data types.
❌ Cons:
Requires pyarrow or fastparquet dependencies.
Not as universally used as CSV.
4. Manual Preprocessing for CSV Storage
Users often manually convert complex data to JSON strings before saving them to CSV.
Example
import json
df["embedding"] = df["embedding"].apply(lambda x: json.dumps(x.tolist()))
df.to_csv("data.csv", index=False)
df_loaded = pd.read_csv("data.csv")
df_loaded["embedding"] = df_loaded["embedding"].apply(lambda x: np.array(json.loads(x)))
✅ Pros:
Works with existing Pandas functionality.
CSV remains human-readable.
❌ Cons:
Requires manual preprocessing each time.
Error-prone and inefficient for large datasets.
Why the Proposed Feature is Needed
While these alternatives exist, they require either additional dependencies, manual preprocessing, or compromise on format usability. Adding native support for preserving and restoring complex data types in Pandas CSV operations would:
Eliminate the need for external libraries like JSON, Pickle, or Parquet.
Improve usability for machine learning and data science workflows.
Keep CSV files human-readable while ensuring data integrity.
The text was updated successfully, but these errors were encountered: