Replies: 7 comments
-
hi @miriam-z can you provide more context on what you're trying to achieve? Currently pandera doesn't support validating SQL tables |
Beta Was this translation helpful? Give feedback.
-
@miriam-z I'm not very familiar with the To use custom data types, two approaches I can currently think of are:
Since you also want to generate synthetic data, I'd recommend approach (2), @jeffzi let me know if you have other ideas. The new pandera type system doesn't currently support custom strategies, though we might want to look into how to support that. Once you have a minimally repro. example in this thread we can help you out with a more concrete code implementation.
To clarify, pandera allows users to specify the Type of a dataframe, it's not the Currently pandera only supports adding columns to the end with add_columns. I'd be open to an As a workaround, you could do something like: original_schema = pa.DataFrameSchema({
"col": pa.Column()
})
new_schema = pa.DataFrameSchema({
"new_col_at_beginning": pa.Column(),
**original_schema.columns,
"new_col_at_end": pa.Column(),
}) |
Beta Was this translation helpful? Give feedback.
-
The first argument of Here is an example where I check against currency codes listed in import pandera as pa
import pycountry
schema = pa.DataFrameSchema(
{
"USD Currency": pa.Column(
str,
checks=pa.checks.Check.eq(
pycountry.currencies.lookup("usd").alpha_3
),
required=True,
),
"Local Currency": pa.Column(
str,
checks=pa.checks.Check.isin(
[currency.alpha_3 for currency in list(pycountry.currencies)]
),
required=True,
),
}
)
df = schema.example(size=3) # generate fake data
schema.validate(df) # verify fake data is valid
#> USD Currency Local Currency
0 USD SYP
1 USD ARS
2 USD XOF Pandera can only generate valid examples. You'd need to modify the valid dataframe or create a second schema for failing examples. @cosmicBboy re:
In my view, pandera Data Types check the native data type (so far native = pandas/numpy dtypes). They are not meant to create URL, Country, etc. data types. For that purpose, I think users can create generic functions that return a pre-built Column. e.g.: def CurrencyCode(**kwargs) -> pa.Column:
# would need to reject data type in kwargs + append extra checks
return pa.Column(
str,
checks=pa.checks.Check.isin(
[currency.alpha_3 for currency in list(pycountry.currencies)]
),
**kwargs
)
# ^ simply call CurrencyCode() instead of pa.Column in schema definition. We could have pre-defined shortcuts for common columns (URL, Path, etc.) or explain that trick in a cookbook. Let me know what you think. |
Beta Was this translation helpful? Give feedback.
-
@cosmicBboy and @jeffzi Thank you, I added in the first question a sample of csv pycountry.countries.lookup(s).alpha_2 returns US i.e These are basic strings. But getting: Exception has occurred: IndexError During handling of the above exception, another exception occurred: Also trying to understand nullable and required, in the context for a Boolean which is true or false, so Boolean requires nullable and required just like a String/Int type? Thanks @jeffzi I changed to code to have type as first argument, not sure where tuple index out of range came from? So a lambda is not correct then:
EDIT fixed the input validation now, but using:
So we validate the input CSV, and then once transformed, we need to validate the transformed schema again, is that normal best practice? |
Beta Was this translation helpful? Give feedback.
-
That's indeed a bug. Your check was raising a Keep in mind that a check must return a boolean or a Series of booleans. Moreover, by default the input will be a Series whereas pycountry.countries.lookup(s) is expecting a string. I'd recommend reading the checks documentation for a full explanation and examples.
Yes, typically you would wrap your transformations in functions and validate the input and output of those transformations. That ensures data is valid at each step of your pipeline. Decorators are available to make that process smoother. |
Beta Was this translation helpful? Give feedback.
-
Checks are vectorized by default, meaning that pa.Check(lambda s: pycountry.countries.lookup(s).alpha_2) So you can either map over it: pa.Check(lambda s: s.map(lambda x: pycountry.countries.lookup(x).alpha_2)) Or use pa.Check(lambda x: pycountry.countries.lookup(x).alpha_2, element_wise=True) Though this check as-is doesn't return a boolean value, something like this would be a valid check pa.Check(lambda x: pycountry.countries.lookup(x).alpha_2 == "US", element_wise=True)
# or
pa.Check(lambda x: pycountry.countries.lookup(x).alpha_2 in {"US", "FR", ...}, element_wise=True) |
Beta Was this translation helpful? Give feedback.
-
this issue should be fixed by #613, will convert this into a discussion |
Beta Was this translation helpful? Give feedback.
-
Question about pycountry usage inside pandera
Sorry need to change question but not sure if here or in pycountry is more applicable:
Just wanting to ask if we can validate a Column from
pycountry/pycountry#82
Second question:
How to generate test fake data to pass or fail the schema?
Third question:
How to convert schema into Dataframe and append additional Columns to either side of the original Dataframe?
train.csv:
DealName,D1,D2,D3,D4,D5,IsActive,Country,Currency,Company
Deal123,1223456789123445677.88888888,1223456789123445677.88888888,123456789123445677.888888,123456789123445677.888888,123456789123445677.888888,false,US,USD,FakeCompany
Beta Was this translation helpful? Give feedback.
All reactions