Replies: 5 comments 28 replies
-
As an aside, to me the whole thing smells of Monads but I'm too unfamiliar/rusty to really be able to express it properly so please read the following in that context. My guesses are that:
Not sure if there is anything to be gained by making this identification. |
Beta Was this translation helpful? Give feedback.
-
A small correction on terminology, which I realized only recently: In context of databases, table usually means "persistently stored relation", while relation is the ordered set of tuples. |
Beta Was this translation helpful? Give feedback.
-
I think of the whole thing very similarly, except for this difference: - pipeline: a sequence of pipelines and transforms which produce a table
+ pipeline: a sequence transforms which produce a relation Relation (or table if I use your terminology) can then be either:
And the line between relation and transform is kinda blurry.
The difference in name may be only in whether you are asking "what is the value of this expression" as opposed to "what is the expression composed of"... |
Beta Was this translation helpful? Give feedback.
-
Regardless of the type of a variable (be it table, relation, function, transform or pipeline), I think we can change syntax of variable declaration to use let and an optional type annotation: -table a = (...)
-pipeline a = (...)
+let a <relation> = (...) |
Beta Was this translation helpful? Give feedback.
-
Regarding custom transforms: this is implemented but not documented:
|
Beta Was this translation helpful? Give feedback.
-
Recent developments on #523 reminded me that I've long wanted to have a discussion around clarifying terms and concepts around tables, transforms and pipelines and how there relate to CTEs and (table valued) functions.
I have developed a framework for myself around this and will be using the terms defined as follows (which doesn't correspond 100% to how we have been using these terms in discussions so far, particularly the term pipeline):
Transforms are made up of "atomic" transforms which correspond well to the transforms listed in section 3 of the book:
They mostly have a signature of
transform: table -> table
with one notable exception, namelyfrom
which is actually a pipeline (the atomic pipeline) and has signaturefrom: null -> table
.So in my terminology we then have the following:
employees<table[employee_id:int, name:str]>
salaries<table[employee_id:int, salary:float]>
<table -> table>
, e.gitake<int table -> table>
<null -> table>
, e.g.from<str null -> table>
(We could argue about the null here. I initially didn't include the str in my signature. I just want to make it explicit that it doesn't act on a previous pipeline.)Now one of the things that sets PRQL apart is the composability its transforms, with transforms being composed simply with
\n
or|
. As of recent PR #1195, we can also put "pipelines" in lots of places which further increases PRQL's capabilities here.Being able to produce custom (and reusable) transforms in PRQL could really be one of the key features for the language. However in practice I think we're not quite there yet, and in my analysis this comes down to how "namespaces" are handled in SQL. I will try to illustrate this with some examples. I'll largely make up some syntax.
It would be great to be able to generalise this, so let's say we could parameterise the pipeline definition in which case it actually becomes a transform
This looks reasonable and we might expect the following to work:
Because we know the previous pipeline always gets passed to the transform as the last parameter, we should be able to make the identification
s=salaries
.However what happens when we pass a pipeline with multiple source tables, e.g.
? It would not be clear which source table/namespace to take the
salary
column from. Maybe this is not an issue as long as the source pipeline fulfills the type restrictiontable[salary:float]
of having a uniquely named columnsalary
of typefloat
.Dataframe libraries like Pandas, which have a logical model closer to PRQL, get around this by forcing unique column names at join time, using automatic renaming schemes like
*lsuffix*
and*rsuffix*
(see e.g. pandas.DataFrame.join). SQL is more lenient and let's all columns exist intablename.
prefixed namespaces.This is as far as I've gotten in my explorations. I would welcome any thoughts, comments and hopefully discussion as it's something I've been trying to express for a while but only recently been able to write down in (hopefully) coherent form.
Beta Was this translation helpful? Give feedback.
All reactions