Check from a lookup, generate fake CSV data and append additional columns? #657

miriam-z · 2021-09-07T20:29:06Z

miriam-z
Sep 7, 2021

Question about pycountry usage inside pandera

Sorry need to change question but not sure if here or in pycountry is more applicable:

Just wanting to ask if we can validate a Column from

import pycountry


data = pd.read_csv('src/train.csv')

# Taking a subsample
data_sample = data.sample(n=3)


schema = pa.DataFrameSchema({

    "D1" : pa.Column(pa.Float64, required=True),
    "D2" : pa.Column(pa.Float64, required=False),
    
    "Currency" : pa.Column(pa.String, pa.Check(lambda s: pycountry.countries.lookup(s).alpha_2), required=True), <==How to do for whole Column
    "Country" : pa.Column(pa.Check(lambda s: pycountry.currencies.lookup(s).alpha_3), required=True), <== How to do for whole Column
})

schema.validate(data_sample)

pycountry/pycountry#82

Second question:

How to generate test fake data to pass or fail the schema?

Third question:

How to convert schema into Dataframe and append additional Columns to either side of the original Dataframe?

train.csv:

DealName,D1,D2,D3,D4,D5,IsActive,Country,Currency,Company
Deal123,1223456789123445677.88888888,1223456789123445677.88888888,123456789123445677.888888,123456789123445677.888888,123456789123445677.888888,false,US,USD,FakeCompany

cosmicBboy · 2021-09-07T20:32:26Z

cosmicBboy
Sep 7, 2021
Maintainer

hi @miriam-z can you provide more context on what you're trying to achieve? Currently pandera doesn't support validating SQL tables

0 replies

cosmicBboy · 2021-09-08T14:27:26Z

cosmicBboy
Sep 8, 2021
Maintainer

@miriam-z I'm not very familiar with the pycountry package, so I'm not entirely certain what each value in the Currency and Country columns look like. It would help to (a) show a minimally reproducible code example of what a sample dataframe looks like that you're trying to validate, along with a high-level description of your use case.

To use custom data types, two approaches I can currently think of are:

Build a custom data type with Pandera Data Types
Use the object data type and register a custom check using the extension API and an associated strategy.

How to generate test fake data to pass or fail the schema?

Since you also want to generate synthetic data, I'd recommend approach (2), @jeffzi let me know if you have other ideas. The new pandera type system doesn't currently support custom strategies, though we might want to look into how to support that.

Once you have a minimally repro. example in this thread we can help you out with a more concrete code implementation.

How to convert schema into Dataframe and append additional Columns to either side of the original Dataframe?

To clarify, pandera allows users to specify the Type of a dataframe, it's not the DataFrame object itself. If you want to modify the DataFrameSchema. I'm assuming from your question that column order matters, since you want to insert columns at the beginning of the schema?

Currently pandera only supports adding columns to the end with add_columns.

I'd be open to an insert_columns method, though this would have to be a community-contributed feature, let me know if this is something you'd be willing to contribute @miriam-z !

As a workaround, you could do something like:

original_schema = pa.DataFrameSchema({
    "col": pa.Column()
})

new_schema = pa.DataFrameSchema({
    "new_col_at_beginning": pa.Column(),
    **original_schema.columns,
    "new_col_at_end": pa.Column(),
})

0 replies

jeffzi · 2021-09-08T21:12:22Z

jeffzi
Sep 8, 2021
Collaborator

The first argument of pandera.Column must be a data type compatible with pandas, which is not the case for pycountry objects. As @cosmicBboy said, you could use the pandas object type to store pycountry objects in a DataFrame (option 2.). However, if your objective is to export to csv, I'd convert those objects to a string or integer depending on the attribute (name, code, etc.) you're interested in. It will also be much easier to handle a simple type with standard pandas functions.

Here is an example where I check against currency codes listed in Currency.alpha_3, using built-in checks. Note that you just need to call .example() to generate a dataframe that satisfies the schema.

import pandera as pa
import pycountry

schema = pa.DataFrameSchema(
    {
        "USD Currency": pa.Column(
            str,
            checks=pa.checks.Check.eq(
                pycountry.currencies.lookup("usd").alpha_3
            ),
            required=True,
        ),
        "Local Currency": pa.Column(
            str,
            checks=pa.checks.Check.isin(
                [currency.alpha_3 for currency in list(pycountry.currencies)]
            ),
            required=True,
        ),
    }
)

df = schema.example(size=3) # generate fake data
schema.validate(df) # verify fake data is valid
#>   USD Currency Local Currency
0          USD            SYP
1          USD            ARS
2          USD            XOF

Pandera can only generate valid examples. You'd need to modify the valid dataframe or create a second schema for failing examples.

@cosmicBboy re:

Build a custom data type with Pandera Data Types

In my view, pandera Data Types check the native data type (so far native = pandas/numpy dtypes). They are not meant to create URL, Country, etc. data types. For that purpose, I think users can create generic functions that return a pre-built Column. e.g.:

def CurrencyCode(**kwargs) -> pa.Column:
    # would need to reject data type in kwargs + append extra checks
    return pa.Column(
        str,
        checks=pa.checks.Check.isin(
            [currency.alpha_3 for currency in list(pycountry.currencies)]
        ),
        **kwargs
    )
 # ^ simply call CurrencyCode() instead of pa.Column in schema definition.

We could have pre-defined shortcuts for common columns (URL, Path, etc.) or explain that trick in a cookbook. Let me know what you think.

0 replies

miriam-z · 2021-09-09T00:17:43Z

miriam-z
Sep 9, 2021
Author

@cosmicBboy and @jeffzi Thank you, I added in the first question a sample of csv

pycountry.countries.lookup(s).alpha_2 returns US
pycountry.currencies.lookup(s).alpha_3 returns USD

i.e These are basic strings.

But getting:

Exception has occurred: IndexError
tuple index out of range
File "$HOME/file_processing_pipeline/src/simple_pandera_validate.py", line 42, in
"Country" : pa.Column(pa.String, pa.Check(lambda s: pycountry.countries.lookup(s).alpha_2), required=False),

During handling of the above exception, another exception occurred:

Also trying to understand nullable and required, in the context for a Boolean which is true or false, so Boolean requires nullable and required just like a String/Int type?

Thanks @jeffzi I changed to code to have type as first argument, not sure where tuple index out of range came from?

So a lambda is not correct then:

pa.Check(lambda s: pycountry.countries.lookup(s).alpha_2) 
pa.Check(lambda s: pycountry.currencies.lookup(s).alpha_3)

EDIT fixed the input validation now, but using:

    "Country" : pa.Column(pa.String, checks=pa.Check.isin(
                [country.alpha_2 for country in list(pycountry.countries)]), required=False),
    "Currency" : pa.Column(pa.String, checks=pa.Check.isin(
                [currency.alpha_3 for currency in list(pycountry.currencies)]), required=False),

So we validate the input CSV, and then once transformed, we need to validate the transformed schema again, is that normal best practice?

0 replies

jeffzi · 2021-09-09T23:07:56Z

jeffzi
Sep 9, 2021
Collaborator

Exception has occurred: IndexError

That's indeed a bug. Your check was raising a LookupError() that does not have a message, and pandera did not handle that case graciously. I just submitted a fix (#613). That will resolve your other issue #605 as well.

Keep in mind that a check must return a boolean or a Series of booleans. Moreover, by default the input will be a Series whereas pycountry.countries.lookup(s) is expecting a string. I'd recommend reading the checks documentation for a full explanation and examples.

So we validate the input CSV, and then once transformed, we need to validate the transformed schema again, is that normal best practice?

Yes, typically you would wrap your transformations in functions and validate the input and output of those transformations. That ensures data is valid at each step of your pipeline. Decorators are available to make that process smoother.

0 replies

cosmicBboy · 2021-09-10T13:09:56Z

cosmicBboy
Sep 10, 2021
Maintainer

Checks are vectorized by default, meaning that s is a series:

pa.Check(lambda s: pycountry.countries.lookup(s).alpha_2)

So you can either map over it:

pa.Check(lambda s: s.map(lambda x: pycountry.countries.lookup(x).alpha_2))

Or use element_wise=True to turn it into a scalar check

pa.Check(lambda x: pycountry.countries.lookup(x).alpha_2, element_wise=True)

Though this check as-is doesn't return a boolean value, something like this would be a valid check

pa.Check(lambda x: pycountry.countries.lookup(x).alpha_2 == "US", element_wise=True)

# or

pa.Check(lambda x: pycountry.countries.lookup(x).alpha_2 in {"US", "FR", ...}, element_wise=True)

0 replies

cosmicBboy · 2021-10-14T19:49:36Z

cosmicBboy
Oct 14, 2021
Maintainer

this issue should be fixed by #613, will convert this into a discussion

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check from a lookup, generate fake CSV data and append additional columns? #657

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Check from a lookup, generate fake CSV data and append additional columns? #657

miriam-z Sep 7, 2021

Question about pycountry usage inside pandera

Replies: 7 comments

cosmicBboy Sep 7, 2021 Maintainer

cosmicBboy Sep 8, 2021 Maintainer

jeffzi Sep 8, 2021 Collaborator

miriam-z Sep 9, 2021 Author

jeffzi Sep 9, 2021 Collaborator

cosmicBboy Sep 10, 2021 Maintainer

cosmicBboy Oct 14, 2021 Maintainer

miriam-z
Sep 7, 2021

cosmicBboy
Sep 7, 2021
Maintainer

cosmicBboy
Sep 8, 2021
Maintainer

jeffzi
Sep 8, 2021
Collaborator

miriam-z
Sep 9, 2021
Author

jeffzi
Sep 9, 2021
Collaborator

cosmicBboy
Sep 10, 2021
Maintainer

cosmicBboy
Oct 14, 2021
Maintainer