-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support lineage of Pandas.DataFrame #665
Comments
@harishyadavdevops this problem is most likely caused by the fact that you are reading your Excel file through Pandas API, which is not directly supported by Spline. If you click on the icon to open the detailed execution plan, and there you should see a 4th terminal node that represents your Excel data, but just isn't recognised as a read command (that's why you don't see it on the high-level lineage overview. Try read the Excel file using Spark Excel connector instead of Pandas. |
df_cmo_master = spark.read.format("com.crealytics.spark.excel")\
.option('header','true')\
.option('inferSchema','true')\
.load(f"{input_filepath}/CMO_ERICA_AIM_SAP_Mapping_Master_Latest.xlsx")\
.select(\
col("IDERICA"),\
col("TargetDays").alias("target_days"),\
col("PrimaryPlatformPlan").alias("plan_platfrom"),\
col("sitename").alias("cmo_site"),\
col("primaryplatform").alias("pes_platform"),\
)\
.distinct() @wajda i have used above code for reading the xlsx file . and code ran perfectly. but when i use this i faced the issue with no lineage is redirecting to Spline UI. Emplty screen is populated in UI. |
By default Spline agent only reacts on writing data to a persistent storage, i.e. spline.plugins.za.co.absa.spline.harvester.plugin.embedded.NonPersistentActionsCapturePlugin.enabled=true |
Hi Alex,
Greetings of the day !!
I need to set up the Spline with Secure (HTTPS). I have followed these
steps https://absaoss.github.io/spline/0.4.html but it didn't work.
request you to send me some document or links to setp the spline server
with HTTPS secure on ubuntu OS.
need this very badly for me.
<https://absaoss.github.io/spline/0.4.html>
…On Mon, Apr 24, 2023 at 7:35 PM Alex Vayda ***@***.***> wrote:
By default Spline agent only reacts on writing data to a persistent
storage, i.e. df.write(), never on df.read(), df.show() etc.
You can enable capturing memory-only actions if you want, it could be
useful for debugging purposes:
spline.plugins.za.co.absa.spline.harvester.plugin.embedded.NonPersistentActionsCapturePlugin.enabled=true
—
Reply to this email directly, view it on GitHub
<#665 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A2USDCYG2AXU2J4BUETIBOLXC2CDXANCNFSM6AAAAAAXJNMQHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
--
Thanks & Regards
Pyadindi Harish Yadav
Associate Software Engineer - DevOps
[image: photo]
Contact: +91-8639581806
Email: ***@***.***
|
For example, if you use Tomcat to run Spline, you have to set up tomcat tu sopport HTTPS. |
i have done secure through AWS load balancer here is the url .
https://xxxxxxxx.xxxxxxxx.com:9443/producer
https://xxxxxxxx.xxxxxxxx.com:9443/consumer
*Here , I have passed below values in databricks cluster. but i lineage is
not redirecting to spline UI. ----------------------------> can you
guide me in this.*
*spark.spline.lineageDispatcher
httpsspark.spline.lineageDispatcher.https.producer.url
https://xxxxxxxx.xxxxxxxx.com:9443/producer
<https://xxxxxxxx.xxxxxxxx.com:9443/producer>spark.spline.mode
ENABLEDspark.databricks.delta.preview.enabled true*
*mycode in databricks notebook:*
sc._jvm.za.co.absa.spline.harvester.SparkLineageInitializer.enableLineageTracking(spark._jsparkSession)
#set variable to be used to connect the database
database = "Superstore"
table = "Superstore.dbo.SalesTransaction"
user = "hsbfhs"
password = "hshhfgsh"
#read table data into a spark dataframe
jdbcDF = spark.read.format("jdbc") \
.option("url",
f"jdbc:sqlserver://xxxxxxxxx.com:1433
;databaseName={database};") \
.option("dbtable", table) \
.option("user", user) \
.option("password", password) \
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
.load()
jdbcDF.createOrReplaceTempView("jdbcDF")
sqlserver_ouput = spark.sql("""
select jdbcDF.discount , jdbcDF.profit , jdbcDF.sales , jdbcDF.Quantity ,
jdbcDF_PM.pid , jdbcDF_PM.subid , jdbcDF_PM.catid
from jdbcDF
inner join jdbcDF_PM on (jdbcDF.productname == jdbcDF_PM.name)""")
#print("print the dataframe :", sqlserver_ouput);
#unionDF = S3_file_output.union(sqlserver_ouput)
#set variable to be used to connect the database
#database = "databricks"
table = "siftdd"
user = ***@***.***"
password = "**************"
#write the dataframe into a sql table
sqlserver_ouput.write.mode("append").saveAsTable(table)
#job.commit()
…On Tue, May 9, 2023 at 8:56 PM Adam Cervenka ***@***.***> wrote:
Spline is a web application. HTTPS is managed by the web server, not the
application itself.
For example, if you use Tomcat to run Spline, you have to set up tomcat tu
sopport HTTPS.
https://tomcat.apache.org/tomcat-9.0-doc/ssl-howto.html
—
Reply to this email directly, view it on GitHub
<#665 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A2USDCYJPDGMRXCT7XP2EKTXFJO3PANCNFSM6AAAAAAXJNMQHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
--
|
We try to help when possible, but we cannot spend time in meetings doing tech support. I cannot tell what is wrong from the code you provided, but I put together a troubleshooting guide. You can try to go through it and find the issue yourself. I hope it will help: AbsaOSS/spline#1225 Another thing: All messages you send to this ticket are public GitHub issues, so be sure not to share any sensitive data here. |
I HAVE A JOBS IN AWS GLUE. but when i ran that job it ran successfully in aws glue. but lineage is not populated in spline. |
It should be supported. But I think the discussion has already deviated far from the original topic. Please look through this - https://github.com/search?q=repo%3AAbsaOSS%2Fspline-spark-agent+glue&type=issues |
Hi Alex Vayda
I stopped using the databricks for a while and will start using the
databricks later feb-2024.
So, i have an question again on spline, can i please get the clarification
please.
1. After building the image and deploying I could see the UI is directly
accessible. so, my question is does spline support the user authentication
mechanism .
2. If spline supports the user based authentication mechanism can you
please send me the article on this how to enable the user authentication
mechanism.
Thankyou in advance looking forward to your reply .
…On Fri, Dec 22, 2023 at 11:20 PM Alex Vayda ***@***.***> wrote:
It should be supported. But I think the discussion has already deviated
far from the original topic.
Please look though this -
https://github.com/search?q=repo%3AAbsaOSS%2Fspline-spark-agent+glue&type=issues
If it doesn't help, create a separate issue or a discussion. Help us to
keep thinks organised.
Thank you.
—
Reply to this email directly, view it on GitHub
<#665 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A2USDC2JMU47RSETNHMIAJTYKXB5RAVCNFSM6AAAAAAXJNMQHGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRXHEZTONRQGM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
--
Thanks & Regards
Pyadindi Harish Yadav
DevOps Engineer
[image: photo]
Contact: +91-8639581806
Email: ***@***.***
|
The short answer is - No, neither UI nor the REST API has any auth mechanism built-in. Likewise there is no notion of "user" in the system - no on-boarding is required to start using it. The longer answer is the following. |
hey i want to build the spline image by won docker file and run.sh and
other dependencies file.
i dont want to have ever time to pull your docker image , i f i want to
install into new vms.
hence i request you to suggest a way to this, that oils be better if files
are shard
…On Sat, Dec 23, 2023 at 5:28 PM Alex Vayda ***@***.***> wrote:
The short answer is - No, neither UI nor the REST API has any auth
mechanism built-in. Likewise there is no notion of "user" in the system -
no on-boarding is required to start using it.
The longer answer is the following.
The intention for Spline was to create a simple core system that focuses
on one thing only - lineage tracking. The authentication layer can be added
on top of it, for example by putting a simple proxy in front of the it that
would intercept any HTTP calls and perform authentication. This would
basically allow to implement all-or-nothing access control style. If you
need more granular access control then the things start being more complex
and involved. Some simpler authorization use-cases could still be
implemented on the proxy level by intercepting not only requests, but also
response and filtering the content being returned to the user. But more
complex and sophisticated use-cases definitely have to be implemented in
the Spline core. It all depends on what exactly your requirements are.
—
Reply to this email directly, view it on GitHub
<#665 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A2USDC4JLNFBMGKE4QJ7P7LYK3BPVAVCNFSM6AAAAAAXJNMQHGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRYGI3TQNZVGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
--
Thanks & Regards
Pyadindi Harish Yadav
DevOps Engineer
[image: photo]
Contact: +91-8639581806
Email: ***@***.***
|
hi , I am trying to execute the program from aws glue. and the program had
got executed. but the lineage is not visible in spline.
*Version of spline agent : *
s3://glue-479930578883-eu-west-2/lib/spark-3.3-spline-agent-bundle_2.12-2.0.0.jar
*I am passing this env parameter in glue . *
*Key *= --conf, *value *= spark.spline.producer.url=
http://18.117.242.93:8080/producer --conf
spark.sql.queryExecutionListeners=za.co.absa.spline.harvester.listener.SplineQueryExecutionListener)
Can you please let me nnow if you supportin the lonegae capturing of
snowflake.
…On Tue, Jun 11, 2024 at 8:29 PM Alex Vayda ***@***.***> wrote:
https://github.com/AbsaOSS/spline-getting-started/blob/main/building-docker.md
—
Reply to this email directly, view it on GitHub
<#665 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A2USDC7CQTRZXZLXQC4TZF3ZG4GGRAVCNFSM6AAAAABJCXAZ7WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRQHE4DKNZQGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
--
Thanks & Regards
Pyadindi Harish Yadav
DevOps Engineer
[image: photo]
Contact: +91-8639581806
Email: ***@***.***
|
*Hi AbsaOSS/spline-spark-agent,*
*Good Day !!*
*Below i the error i am getting *
ERROR Inbox: Ignoring error
java.io.NotSerializableException: org.apache.spark.storage.StorageStatus
Serialization stack:
- object not serializable (class:
org.apache.spark.storage.StorageStatus, value:
***@***.***)
- element of array (index: 0)
- array (class [Lorg.apache.spark.storage.StorageStatus;, size 10)
at
org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:49)
~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:115)
~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at
org.apache.spark.rpc.netty.NettyRpcEnv.serialize(NettyRpcEnv.scala:286)
~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at
org.apache.spark.rpc.netty.RemoteNettyRpcCallContext.send(NettyRpcCallContext.scala:64)
~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at
org.apache.spark.rpc.netty.NettyRpcCallContext.reply(NettyRpcCallContext.scala:32)
~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at
org.apache.spark.storage.BlockManagerMasterEndpoint$$anonfun$receiveAndReply$1.applyOrElse(BlockManagerMasterEndpoint.scala:160)
~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:103)
~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at
org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
~[?:1.8.0_412]
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
~[?:1.8.0_412]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
~[?:1.8.0_412]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
~[?:1.8.0_412]
at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_412]
24/08/07 10:14:26 INFO BlockManagerInfo: Added broadcast_7_piece0 in memory
on 172.34.0.225:37147 (size: 17.8 KiB, free: 5.8 GiB)
On Wed, Aug 7, 2024 at 3:18 PM Harish Yadav Pyadindi <
***@***.***> wrote:
… hi , I am trying to execute the program from aws glue. and the program had
got executed. but the lineage is not visible in spline.
*Version of spline agent : *
s3://glue-479930578883-eu-west-2/lib/spark-3.3-spline-agent-bundle_2.12-2.0.0.jar
*I am passing this env parameter in glue . *
*Key *= --conf, *value *= spark.spline.producer.url=
http://18.117.242.93:8080/producer --conf
spark.sql.queryExecutionListeners=za.co.absa.spline.harvester.listener.SplineQueryExecutionListener)
Can you please let me nnow if you supportin the lonegae capturing of
snowflake.
On Tue, Jun 11, 2024 at 8:29 PM Alex Vayda ***@***.***>
wrote:
>
> https://github.com/AbsaOSS/spline-getting-started/blob/main/building-docker.md
>
> —
> Reply to this email directly, view it on GitHub
> <#665 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/A2USDC7CQTRZXZLXQC4TZF3ZG4GGRAVCNFSM6AAAAABJCXAZ7WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRQHE4DKNZQGI>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
--
--
Thanks & Regards
Pyadindi Harish Yadav
DevOps Engineer
[image: photo]
Contact: +91-8639581806
Email: ***@***.***
--
--
Thanks & Regards
Pyadindi Harish Yadav
DevOps Engineer
[image: photo]
Contact: +91-8639581806
Email: ***@***.***
|
@harishyadavdevops, first of all, regarding your questions:
Now, I have to close comments on this issue as it contains too much off topic discussions. |
Originally posted by @harishyadavdevops in #262 (comment)
The text was updated successfully, but these errors were encountered: