You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Many ML workloads such as LogisticRegression generate and require as input datasets of the form RDD[LabeledPoint]. Converting back and forth from a weakly typed dataframe to an RDD of LabeledPoint is no issue. The issue is getting datasets of LabeledPoints out to disk from the generators and back in from a variety of formats.
The legacy version of Spark-Bench wrote out datasets of LabeledPoints out as text files so each row was a string that would need to be parsed by the workload. This string parsing is a major hit to performance, particularly when formats like Parquet could be used to drastically cut down on storage space and transport time.
Spark-Bench needs a way to:
generically write out Dataframes of LabeledPoint out to disk in a variety of formats
generically read datasets of LabeledPoint from disk in a variety of formats and convert them to Dataframes for usage in workloads.
The text was updated successfully, but these errors were encountered:
Many ML workloads such as LogisticRegression generate and require as input datasets of the form RDD[LabeledPoint]. Converting back and forth from a weakly typed dataframe to an RDD of LabeledPoint is no issue. The issue is getting datasets of LabeledPoints out to disk from the generators and back in from a variety of formats.
The legacy version of Spark-Bench wrote out datasets of LabeledPoints out as text files so each row was a string that would need to be parsed by the workload. This string parsing is a major hit to performance, particularly when formats like Parquet could be used to drastically cut down on storage space and transport time.
Spark-Bench needs a way to:
The text was updated successfully, but these errors were encountered: