Tables, Transforms and Pipelines #1323

snth · 2022-12-22T11:06:34Z

snth
Dec 22, 2022
Maintainer

Recent developments on #523 reminded me that I've long wanted to have a discussion around clarifying terms and concepts around tables, transforms and pipelines and how there relate to CTEs and (table valued) functions.

I have developed a framework for myself around this and will be using the terms defined as follows (which doesn't correspond 100% to how we have been using these terms in discussions so far, particularly the term pipeline):

table: aka DataFrame, a set of tuples/rows
transform: transforms take a table as input and produce a new table as output
pipeline: a sequence of pipelines and transforms which produce a table

Transforms are made up of "atomic" transforms which correspond well to the transforms listed in section 3 of the book:

They mostly have a signature of transform: table -> table with one notable exception, namely from which is actually a pipeline (the atomic pipeline) and has signature from: null -> table.

So in my terminology we then have the following:

table: a struct type, with optional restrictions of columns that have to exist with given types, e.g.
employees<table[employee_id:int, name:str]>
salaries<table[employee_id:int, salary:float]>
transform: a function type with signature <table -> table>, e.gi
take<int table -> table>
pipeline: a function type with signature <null -> table>, e.g.
from<str null -> table> (We could argue about the null here. I initially didn't include the str in my signature. I just want to make it explicit that it doesn't act on a previous pipeline.)

Now one of the things that sets PRQL apart is the composability its transforms, with transforms being composed simply with \n or |. As of recent PR #1195, we can also put "pipelines" in lots of places which further increases PRQL's capabilities here.

Being able to produce custom (and reusable) transforms in PRQL could really be one of the key features for the language. However in practice I think we're not quite there yet, and in my analysis this comes down to how "namespaces" are handled in SQL. I will try to illustrate this with some examples. I'll largely make up some syntax.

pipeline top_5_salaries = (
  from s=salaries
  sort [-s.salary]
  take 5
)

It would be great to be able to generalise this, so let's say we could parameterise the pipeline definition in which case it actually becomes a transform

transform take_top_salaries n<int> s<table[salary:float]> = (
  sort [-s.salary]
  take n
)

This looks reasonable and we might expect the following to work:

from salaries
take_top_salaries 5

Because we know the previous pipeline always gets passed to the transform as the last parameter, we should be able to make the identification s=salaries.

However what happens when we pass a pipeline with multiple source tables, e.g.

from employees
join salaries [==employee_id]
take_top_salaries 5

? It would not be clear which source table/namespace to take the salary column from. Maybe this is not an issue as long as the source pipeline fulfills the type restriction table[salary:float] of having a uniquely named column salary of type float.

Dataframe libraries like Pandas, which have a logical model closer to PRQL, get around this by forcing unique column names at join time, using automatic renaming schemes like *lsuffix* and *rsuffix* (see e.g. pandas.DataFrame.join). SQL is more lenient and let's all columns exist in tablename. prefixed namespaces.

This is as far as I've gotten in my explorations. I would welcome any thoughts, comments and hopefully discussion as it's something I've been trying to express for a while but only recently been able to write down in (hopefully) coherent form.

snth · 2022-12-22T11:41:10Z

snth
Dec 22, 2022
Maintainer Author

As an aside, to me the whole thing smells of Monads but I'm too unfamiliar/rusty to really be able to express it properly so please read the following in that context.

My guesses are that:

from is the unit/return that takes a table name (object) and materialises it in a database Monad.
PRQL transforms are functions
The compiled SQL are monadic actions acting in the database monad.

Not sure if there is anything to be gained by making this identification.

1 reply

aljazerzen Dec 22, 2022
Maintainer

I cannot comment on that, I haven't delved too deep into Haskell / functional languages.

aljazerzen · 2022-12-22T11:45:23Z

aljazerzen
Dec 22, 2022
Maintainer

A small correction on terminology, which I realized only recently:

In context of databases, table usually means "persistently stored relation", while relation is the ordered set of tuples.

3 replies

snth Dec 22, 2022
Maintainer Author

Right, so a view would also be a relation and the difference to a table is that the table is persistently stored?

aljazerzen Dec 22, 2022
Maintainer

Yes, exactly.

max-sixty Dec 22, 2022
Maintainer

Ah OK great, I'll start using this too

aljazerzen · 2022-12-22T12:05:14Z

aljazerzen
Dec 22, 2022
Maintainer

I think of the whole thing very similarly, except for this difference:

- pipeline: a sequence of pipelines and transforms which produce a table
+ pipeline: a sequence transforms which produce a relation

Relation (or table if I use your terminology) can then be either:

a table, i.e. a source of data, or
a pipeline

And the line between relation and transform is kinda blurry.

expression	type	name
`filter a > 3`	`relation -> relation`	transform
`filter`	`column<bool> relation -> relation`	also transform?
`from`	`ident -> relation`	also transform?
`from my_table`	`relation`	relation (a single transfrom)
`from my_table \| filter a > 3`	`relation`	relation (two transfroms)

The difference in name may be only in whether you are asking "what is the value of this expression" as opposed to "what is the expression composed of"...

11 replies

snth Dec 22, 2022
Maintainer Author

Not necessarily, in my mind I just have them sitting in slightly different categories.

The way I think about the relation/table is as a concrete table definition in a database. As per rslabbert's comment in #381 I pictured us connecting to a database, downloading the schema definitions and storing these in "declaration files", I guess kinda like DDL, perhaps both table and view being subtypes of relation. You could store such declaration files in different directories for different databases, say prod, uat and dev. A pipeline would be a PRQL query which runs against one of these directories/environments and then evaluates to a resultant relation type, perhaps writing out a declaration file of it's own. A pipeline with signature () -> T on the other hand would be DQL. I don't know if the distinction is important from the compiler's perspective, to me they just seem like separate concepts.

So pipelines and transforms live in the PRQL Query space/category and then the PRQL Compiler is a Functor that transforms them to SQL in the Database space/category. relations are like table definitions that define the shape of data, and which "live" at the interface of consecutive transforms.

Through type inference you could hopefully figure out what schema constraints your source relations have to satisfy for a particular transform or pipeline so that you could easily check this against a particular relation declaration file to figure out if it's compatible or not.

aljazerzen Dec 22, 2022
Maintainer

Well, I see most of the things you described exactly the same. With the extension that PRQL query (RQ) is also a relation - just a computed one.

max-sixty Dec 29, 2022
Maintainer

I think this is exactly how I think about it.

I'm not sure exactly where Pipeline fits here — it seems that Relation and Transform is fairly complete. But we are somewhat bound to the Pipeline term given the PRQL name! And it does intuitively fill a useful space where "Relation followed by a series of Transforms" isn't welcome.

So maybe a Pipeline means a series of transforms, possibly bound to a relation?

snth Jan 10, 2023
Maintainer Author

So maybe a Pipeline means a series of transforms, possibly bound to a relation?

That's how I think about the term pipeline and how I have been using it above. However in the book currently pipeline seems to mean (possibly multi-step) transform (eg as an argument to group) while table is used to refer to relations (eg as an argument to join, from, and concat and union).

I'm happy to adopt the terminology of the book but I just think we need to pick one convention, spell it out explicitly somewhere and then use it consistently because the concepts are not interchangeable, i.e. join and group take arguments of different types and we should have names for them.

Given that the P in PRQL stands for pipeline, I think the pipeline term should be given a prominent role though.

aljazerzen Jan 10, 2023
Maintainer

I think of pipeline more like just a syntactic construct - like binary operation or function call. I'd also call this a pipeline:

(12 | div 5 | floor | mul 5)

... it's just that it doesn't evaluate to a relation, but to a number.

So:

group takes a function (usually expressed as a pipeline),
from and join take a relation (possibly expressed as a pipeline),
concat takes a relation (possibly expressed as a pipeline).

I'd say that the name of the language implies that it uses this construct as a core feature.

aljazerzen · 2022-12-22T14:36:32Z

aljazerzen
Dec 22, 2022
Maintainer

Regardless of the type of a variable (be it table, relation, function, transform or pipeline), I think we can change syntax of variable declaration to use let and an optional type annotation:

-table a = (...)
-pipeline a = (...)
+let a <relation> = (...)

5 replies

snth Dec 22, 2022
Maintainer Author

And what about function definitions?

aljazerzen Dec 22, 2022
Maintainer

func my_func a b -> (...)

... can stay as a syntactic sugar for:

let my_func = a b -> (...)

... or something similar, we can still decide on syntax for lambda functions (and if we even need them).

snth Dec 22, 2022
Maintainer Author

I guess I don't really understand what advantage it confers and I worry that the term let will become quite overloaded.

At the moment we have this nice property that you can mostly scan down the left hand column and get a good idea of what happens on that line, e.g.

func a ...
func b ...
table c ...

from d
filter ...

With let you would have to look at the type annotation or the value definition to figure out what the constant/value is.

It might be a worthwhile price to pay but I don't understand the benefit yet.

aljazerzen Dec 22, 2022
Maintainer

Having let also tells you what's happening in this line: a new thing gets defined and can later be used.

C-style languages, did use the type name as start of var def, but modern languages are switching to have let / const / var / auto, because of things like this:

String a = ...;
MyClassDelagator b = ...;
int c = ...;

as opposed to:

let a: String = ...;
let MyClassDelagator = ...;
let c: int = ...;

aljazerzen Dec 22, 2022
Maintainer

And I don't know about anything else using let ATM.

aljazerzen · 2022-12-22T14:46:41Z

aljazerzen
Dec 22, 2022
Maintainer

Regarding custom transforms: this is implemented but not documented:

func take_top_salaries n rel -> (
  # first expr in the pipeline is the value
  # in our case, the input relation 
  rel
  # then any transforms we want
  # note that we can refer to column in the relation
  # ... if they exist of course
  sort [-s.salary]
  take n
)

# this will work
from s=salaries
take_top_salaries 5

# but this will not: no relation s
from salaries
take_top_salaries 5

7 replies

snth Dec 22, 2022
Maintainer Author

Or another example

from equity
derive [salary = stock_options]
take_top_salaries 5

Could that work? Could we make it work? Would we want to make it work?

I guess the discussion I'm trying to foster is how do we structure PRQL so that we can make the most of the composability so that we can reap the same benefits of functional composability and reusable code as in other languages?

aljazerzen Dec 22, 2022
Maintainer

So would it be possible to change it to: sort [-rel.salary]

No, this wouldn't be possible.

I'm trying to say that rel.salary or s.salary is looked up - as is, using rel.salary as the column name - in the second argument of sort. In case of:

func take_top_salaries n rel -> (
  sort [-rel.salary]
  take n
)
# or equivalently: 
func take_top_salaries n rel -> (
  take n (sort [-rel.salary])
)

... there is no second argument! So this would error out, saying that sort is missing an arg.

But if we do like this:

func take_top_salaries n rel -> (
  take n (sort [-rel.salary] rel)
)

... we'd be looking up rel.salary in rel. In cases you provided, no relation had a column rel.salary, so it would be also produce an error.

snth Dec 22, 2022
Maintainer Author

Btw, the fact that the top example works at all is amazing! 👏 I think it's an incredible feature for PRQL to have and something we should definitely document and highlight as it really sets it apart.

I don't want my somewhat theoretical discussions of what could be possible in theory (or in an ideal language) to detract from that in any way. I've just been trying to figure out for myself to what extent one could start writing reusable functions and packages in PRQL and where the barriers might be.

aljazerzen Dec 22, 2022
Maintainer

I enjoy this theoretical discussion a lot 😄 It brings up corner cases that probably only I know of. Which means that their design may be very biased and in need of a second opinion.

So do try to compose more examples so we can discuss if they make sense in my interpretation and if the compiler should be adjusted.

snth Jan 23, 2023
Maintainer Author

Just linking to #1335 as there is a lot of discussion about namespaces there.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tables, Transforms and Pipelines #1323

{{title}}

Replies: 5 comments 28 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Tables, Transforms and Pipelines #1323

snth Dec 22, 2022 Maintainer

Replies: 5 comments · 28 replies

snth Dec 22, 2022 Maintainer Author

aljazerzen Dec 22, 2022 Maintainer

aljazerzen Dec 22, 2022 Maintainer

snth Dec 22, 2022 Maintainer Author

aljazerzen Dec 22, 2022 Maintainer

max-sixty Dec 22, 2022 Maintainer

aljazerzen Dec 22, 2022 Maintainer

snth Dec 22, 2022 Maintainer Author

aljazerzen Dec 22, 2022 Maintainer

max-sixty Dec 29, 2022 Maintainer

snth Jan 10, 2023 Maintainer Author

aljazerzen Jan 10, 2023 Maintainer

aljazerzen Dec 22, 2022 Maintainer

snth Dec 22, 2022 Maintainer Author

aljazerzen Dec 22, 2022 Maintainer

snth Dec 22, 2022 Maintainer Author

aljazerzen Dec 22, 2022 Maintainer

aljazerzen Dec 22, 2022 Maintainer

aljazerzen Dec 22, 2022 Maintainer

snth Dec 22, 2022 Maintainer Author

aljazerzen Dec 22, 2022 Maintainer

snth Dec 22, 2022 Maintainer Author

aljazerzen Dec 22, 2022 Maintainer

snth Jan 23, 2023 Maintainer Author

snth
Dec 22, 2022
Maintainer

Replies: 5 comments 28 replies

snth
Dec 22, 2022
Maintainer Author

aljazerzen Dec 22, 2022
Maintainer

aljazerzen
Dec 22, 2022
Maintainer

snth Dec 22, 2022
Maintainer Author

aljazerzen Dec 22, 2022
Maintainer

max-sixty Dec 22, 2022
Maintainer

aljazerzen
Dec 22, 2022
Maintainer

snth Dec 22, 2022
Maintainer Author

aljazerzen Dec 22, 2022
Maintainer

max-sixty Dec 29, 2022
Maintainer

snth Jan 10, 2023
Maintainer Author

aljazerzen Jan 10, 2023
Maintainer

aljazerzen
Dec 22, 2022
Maintainer

snth Dec 22, 2022
Maintainer Author

aljazerzen Dec 22, 2022
Maintainer

snth Dec 22, 2022
Maintainer Author

aljazerzen Dec 22, 2022
Maintainer

aljazerzen Dec 22, 2022
Maintainer

aljazerzen
Dec 22, 2022
Maintainer

snth Dec 22, 2022
Maintainer Author

aljazerzen Dec 22, 2022
Maintainer

snth Dec 22, 2022
Maintainer Author

aljazerzen Dec 22, 2022
Maintainer

snth Jan 23, 2023
Maintainer Author