Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] The default behavior of PPL Lookup does not work as expected #1026

Closed
penghuo opened this issue Jan 31, 2025 · 3 comments · Fixed by #1035
Closed

[BUG] The default behavior of PPL Lookup does not work as expected #1026

penghuo opened this issue Jan 31, 2025 · 3 comments · Fixed by #1035
Assignees
Labels
bug Something isn't working

Comments

@penghuo
Copy link
Collaborator

penghuo commented Jan 31, 2025

What is the bug?

  • issue-1, expect source = $sourceTable | LOOKUP $lookupTbl id should replace id column , but the result include 2 id columns. IMO, it should work as same as default behaviour source = $sourceTable | LOOKUP $lookupTbl id REPLACE id, department.
+----+-----+----------+-------+------+----+----------+
|  id| name|occupation|country|salary|  id|department|
+----+-----+----------+-------+------+----+----------+
|1000| Jake|  Engineer|England|100000|1000|        IT|
|1001|Hello|    Artist|    USA| 70000|NULL|      NULL|
|1002| John|    Doctor| Canada|120000|1002|      DATA|
|1003|David|    Doctor|   NULL|120000|1003|        HR|
|1004|David|      NULL| Canada|     0|NULL|      NULL|
|1005| Jane| Scientist| Canada| 90000|1005|      DATA|
+----+-----+----------+-------+------+----+----------+
  • issue-2, got exception when execute source = $sourceTable | LOOKUP $lookupTbl id REPLACE id, department
[AMBIGUOUS_REFERENCE] Reference `id` is ambiguous, could be: [`__auto_generated_subquery_name_l`.`id`, `__auto_generated_subquery_name_s`.`id`].
org.apache.spark.sql.AnalysisException: [AMBIGUOUS_REFERENCE] Reference `id` is ambiguous, could be: [`__auto_generated_subquery_name_l`.`id`, `__auto_generated_subquery_name_s`.`id`].

How can one reproduce the bug?

  • Add following code to FlintSparkPPLLookupITSuite
  protected def sourceTable(testTable: String): Unit = {
    sql(s"""
           | CREATE TABLE $testTable
           | (
           |   id INT,
           |   name STRING,
           |   occupation STRING,
           |   country STRING,
           |   salary INT
           | )
           | USING $tableType $tableOptions
           |""".stripMargin)

    // Insert data into the new table
    sql(s"""
           | INSERT INTO $testTable
           | VALUES (1000, 'Jake', 'Engineer', 'England' , 100000),
           |        (1001, 'Hello', 'Artist', 'USA', 70000),
           |        (1002, 'John', 'Doctor', 'Canada', 120000),
           |        (1003, 'David', 'Doctor', null, 120000),
           |        (1004, 'David', null, 'Canada', 0),
           |        (1005, 'Jane', 'Scientist', 'Canada', 90000)
           | """.stripMargin)
  }

  protected def lookupTbl(testTable: String): Unit = {
    sql(s"""
           | CREATE TABLE $testTable
           | (
           |   id INT,
           |   department STRING
           | )
           | USING $tableType $tableOptions
           |""".stripMargin)

    // Insert data into the new table
    sql(s"""
           | INSERT INTO $testTable
           | VALUES (1000, 'IT'),
           |        (1002, 'DATA'),
           |        (1003, 'HR'),
           |        (1005, 'DATA'),
           |        (1006, 'SALES')
           | """.stripMargin)
  }



  test("test LOOKUP lookupTable") {
    var frame = sql(s"source = $sourceTable | LOOKUP $lookupTbl id")
    frame.show()

    frame = sql(s"source = $sourceTable | LOOKUP $lookupTbl id REPLACE id, department")
    frame.show()
  }

What is the expected behavior?

  • source = $sourceTable | LOOKUP $lookupTbl id
+----+-----+----------+-------+------+----------+
|  id| name|occupation|country|salary|department|
+----+-----+----------+-------+------+----------+
|1000| Jake|  Engineer|England|100000|        IT|
|1001|Hello|    Artist|    USA| 70000|      NULL|
|1002| John|    Doctor| Canada|120000|      DATA|
|1003|David|    Doctor|   NULL|120000|        HR|
|1004|David|      NULL| Canada|     0|      NULL|
|1005| Jane| Scientist| Canada| 90000|      DATA|
+----+-----+----------+-------+------+----------+

What is your host/environment?

  • OS: [e.g. iOS]
  • Version [e.g. 22]: main branch
  • Plugins

Do you have any screenshots?
If applicable, add screenshots to help explain your problem.

Do you have any additional context?
Add any other context about the problem.

@penghuo penghuo added bug Something isn't working untriaged labels Jan 31, 2025
@penghuo
Copy link
Collaborator Author

penghuo commented Feb 4, 2025

@LantaoJin @YANG-DB could you take a look?

@LantaoJin
Copy link
Member

LantaoJin commented Feb 5, 2025

issue-1:

I think you are right, the current behaviour of inputField is "You can specify multiple <inputField> with comma-delimited. If you don't specify any <inputField>, all fields of <lookupIndex> where matched values are applied to result output."

The correct behaviour should be "If you don't specify any <inputField>, all fields of <lookupIndex> that are not the match fields where matched values are applied to result output."

@LantaoJin
Copy link
Member

issue-2:

Seems a known bug. I have a IT "test LOOKUP lookupTable name REPLACE occupation" with workaround. Let me check how to fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants