-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Windowing functions cause errors in lineage harvesting #262
Comments
The Spark code that I've extracted from the log above. DROP TABLE IF EXISTS playground.test_set01_customer;
DROP TABLE IF EXISTS playground.test_set01_order;
DROP TABLE IF EXISTS playground.test_set01_item;
DROP TABLE IF EXISTS playground.test_set01_order_item;
DROP TABLE IF EXISTS playground.test_set01_country_aggregate
CREATE TABLE IF NOT EXISTS playground.test_set01_customer (
id STRING NOT NULL,
name STRING NOT NULL,
email STRING NOT NULL,
city STRING,
state STRING,
country STRING,
catagory INT NOT NULL
) USING PARQUET;
CREATE TABLE IF NOT EXISTS playground.test_set01_order (
id STRING NOT NULL,
customer_id STRING NOT NULL,
ts TIMESTAMP NOT NULL
) USING PARQUET;
CREATE TABLE IF NOT EXISTS playground.test_set01_item (
id STRING NOT NULL,
name STRING NOT NULL,
cost FLOAT NOT NULL,
color STRING NOT NULL
) USING PARQUET;
CREATE TABLE IF NOT EXISTS playground.test_set01_order_item (
order_id STRING NOT NULL,
item_id STRING NOT NULL,
price FLOAT NOT NULL,
discount FLOAT COMMENT 'Order discount as floating percentage value (example: .16 would be a 16% discount)',
cost_rank INT COMMENT 'Cost ranking of the item at the time of order with the 1 being the most expensive item'
) USING PARQUET;
INSERT INTO playground.test_set01_customer VALUES (
'a925eaf2-e342-497b-85fb-773e81bf57f5',
'Michael Fline',
'test@test1.test',
'Santa Monica',
'CA',
'USA',
1
);
INSERT INTO playground.test_set01_customer VALUES (
'a92702b8-db3a-4aaa-b424-51e7ae30d185',
'Bobby Socks',
'test@test2.test',
'Portland',
'OR',
'USA',
2
)
INSERT INTO playground.test_set01_order VALUES (
'32b366fc-3644-4390-a02e-1fed6d658120',
'a925eaf2-e342-497b-85fb-773e81bf57f5',
current_timestamp
)
INSERT INTO playground.test_set01_order VALUES (
'32b366fc-3644-4390-a02e-1fed6d658121',
'a92702b8-db3a-4aaa-b424-51e7ae30d185',
current_timestamp
)
INSERT INTO playground.test_set01_item VALUES (
'42adb6d9-785d-47d3-8899-8eb0dbe1057c',
'Teemo Bath Mat',
17.25,
'Green'
)
INSERT INTO playground.test_set01_item VALUES (
'0630e2cf-59f7-4926-ae12-0c45b51baed3',
'Ahri Hand Soap',
1.15,
'Blue'
)
INSERT INTO playground.test_set01_item VALUES (
'5b560453-706e-48e6-a855-3af70447d3fa',
'Garen Tooth Brush',.75,'Blue')
INSERT INTO playground.test_set01_item VALUES (
'79700b78-f820-4451-b4c2-79408f4b96e7',
'Darius Hair Brush',2.25,'Red')
spark.sql("""SELECT id, name, cost, color FROM playground.test_set01_item WHERE color = 'Blue'""").createOrReplaceTempView("test_set01_blue_item")
INSERT INTO playground.test_set01_order_item
SELECT o.id,
i.id,
i.cost * 1.1,
CASE WHEN c.catagory = 1 THEN .1 ELSE 0 END as discount,
0
FROM playground.test_set01_order o
INNER JOIN playground.test_set01_customer c ON (o.customer_id = c.id)
INNER JOIN test_set01_blue_item i ON (o.id = '32b366fc-3644-4390-a02e-1fed6d658120')
WITH test_set01_blue_item AS (
SELECT id, name, cost, color
FROM playground.test_set01_item
WHERE color = 'Blue'
)
INSERT INTO playground.test_set01_order_item
SELECT o.id, i.id, i.cost * 1.1, CASE WHEN c.catagory = 1 THEN .1 ELSE 0 END as discount, 0
FROM playground.test_set01_order o
INNER JOIN playground.test_set01_customer c ON (o.customer_id = c.id)
INNER JOIN test_set01_blue_item i ON (o.id = '32b366fc-3644-4390-a02e-1fed6d658120')
WITH test_set01_ranked_item AS (
SELECT id, name, cost, RANK() OVER (PARTITION BY color ORDER BY cost DESC) color_cost_rank
FROM playground.test_set01_item
)
INSERT INTO playground.test_set01_order_item
SELECT o.id, i.id, i.cost * 1.1, CASE WHEN c.catagory = 1 THEN .1 ELSE 0 END as discount, i.color_cost_rank
FROM playground.test_set01_order o
INNER JOIN playground.test_set01_customer c ON (o.customer_id = c.id)
INNER JOIN test_set01_ranked_item i ON (o.id = '32b366fc-3644-4390-a02e-1fed6d658120')
WHERE i.color_cost_rank = 1
spark.sql("""SELECT id, name, cost, RANK() OVER (PARTITION BY color ORDER BY cost DESC) color_cost_rank FROM playground.test_set01_item""").createOrReplaceTempView("test_set01_ranked_item")
INSERT INTO playground.test_set01_order_item
SELECT o.id, i.id, i.cost * 1.1, CASE WHEN c.catagory = 1 THEN .1 ELSE 0 END as discount, i.color_cost_rank
FROM playground.test_set01_order o
INNER JOIN playground.test_set01_customer c ON (o.customer_id = c.id)
INNER JOIN test_set01_ranked_item i ON (o.id = '32b366fc-3644-4390-a02e-1fed6d658120')
WHERE i.color_cost_rank = 1 |
Databricks Runtime 8.3 uses Apache Spark 3.1.1. I Tested it on Spark 3.1.1 and 3.1.2 without any issues. I will probably have to test this on an actual databricks instance to find the issue. Spark Scala code I used for testing:
|
@GolamRashed can you share the code to reproduce the issue please? |
I can replicate this on Databricks 8.3. |
What could be the solution for this? Should I contact Databricks or Apache Spark dev team? |
The problem is Databricks changed some code in Spark that we relay on. Now we will have to implement special handling for that in Spline. This is further complicated by the fact that Databricks is closed source, so we can't just look at the code and see the difference. But we are working on it. |
I have raised a ticket through Azure Databricks. Just to confirm, is the issue confined to |
No, It's not a bug in Azure, they simply differ from Spark. I tried to explain it in the previous comment. |
I understand what you mean, Azure is now contacting Databricks to find a solution for this. But my question is, is this issue related to |
Spline Agent itself is |
It would be awesome if Databricks folks contribute to Spline and added a proper support for their closed source stuff. Otherwise it will always be trial-and-error approach. @GolamRashed, could you perhaps refer this issue in the Databricks ticket that you created? |
Great thinking, would be awesome if Databricks contributes! I have included the link for this thread in the support ticket. Spline could be a very good Data Lineage solution for the company. |
Output from ObjectStructureDumper for Databricks Window
|
@wajda Tried it, same error persists -
|
can you show the stack trace please? |
This?
|
Thanks. Hm, I see that the fix doesn't actually address this error for some reason. I guess Adam faced another issue while trying to reproduce it and fixed that one, but not the original one. I can take a look at it. It would be great if you could assist me with testing, as the Databricks Community Cloud that I'm using seems to be down for the last couple of days - creating a cluster takes an hour and then fails. |
It seemed to be working, there were no errors and the related lineage was captured in Spline UI. Does it mean the bug was in Spline agent, not in Databrciks? If yes then I need to close the ticket with Azure/Databrciks. I am more than happy to test Spline in Databricks for you guys! |
Thanks Rashed! |
Can you please detail the issue with specifics so that I can forward it to Databricks? I didn't quite get it from your explanation above. |
This is a definition of the case class Window(
windowExpressions: Seq[NamedExpression], // <-- This one is different in the Databricks runtime
partitionSpec: Seq[Expression],
orderSpec: Seq[SortOrder],
child: LogicalPlan) extends UnaryNode But as you can see from the debug info here - #262 (comment) - Databricks runtime provides an incompatible structure of the Without an access to the source code it's difficult to reason about those expressions semantics. Likewise there is no guarantee that this won't change silently in another Databricks version. |
Thank you @wajda , I will forward this to Databricks. Just a request regarding the Spline UI. Is it possible to add an option to have each lineage capture event be named after the Table/file name that is written? Right now, it just says Databricks Shell. |
Update on my side. WIth the SNAPSHOT bundle provided above, I now see success capturing plans with Windowing functions! It sounds like the above is still an issue due to the closed source nature of Databrick's implementation of the Windowing function. I'll celebrate success for the moment as this clears a significant hurdle for broad use. Thank you @wajda and team! |
Thank you all guys, I'm closing the ticket as resioved then.
Not sure I get it. It also shows a target destination, both a short name and full path. |
@wajda Databricks team replied through Azure support that they have opened an internal ticket for the engineering team to make the Spline library work with DBR. I hope this is good news for the Spline Dev team, and I will keep you posted with updates from Databricks. |
@wajda i need this jar. can you please give me link which is working |
@harishyadavdevops you are referring to an old development snapshot of Agent release 0.6.2. Can't you use a released version? |
@harishyadavdevops your question doesn't have anything to do with the current ticket. Moreover this ticket is closed, so your messages might simply be ignored by people. Please create separate tickets for separate issues. This time I'll create it for you and will move your messages there. |
does spline doesnt support the XLSX files to show lineage in spline UI ??
…On Mon, Apr 24, 2023 at 4:38 PM Alex Vayda ***@***.***> wrote:
@harishyadavdevops <https://github.com/harishyadavdevops> your question
doesn't have anything to do with the current ticket. Moreover this ticket
is closed, so your messages might simply be ignored by people. Please
create separate tickets for separate issues. This time I'll create it for
you and will move your messages there.
—
Reply to this email directly, view it on GitHub
<#262 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A2USDCYWL5E64553WBGHOUDXCZNMRANCNFSM47IO62ZQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
--
Thanks & Regards
Pyadindi Harish Yadav
Associate Software Engineer - DevOps
[image: photo]
Contact: +91-8639581806
Email: ***@***.***
|
answered in #665 |
Spline Team,
Testing some windowing functions in CTEs and temporary tables and we get the errors while processing. I've attached my sample notebook. The first two insert commands work fine. The final two fail.
spline-test-windowing-function.ipynb.txt
The text was updated successfully, but these errors were encountered: