Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XGBoost error code 255 #181

Closed
albertodema opened this issue Nov 12, 2018 · 32 comments
Closed

XGBoost error code 255 #181

albertodema opened this issue Nov 12, 2018 · 32 comments

Comments

@albertodema
Copy link

albertodema commented Nov 12, 2018

Hi I am trying to use the new XGBoost support in master (latest commit d0785f0) but I am facing the following issue:
Here the code (BinaryClassification of Titanic Dataset=passengersData, targetColumn is Survived)

val (saleprice, features) = FeatureBuilder.fromDataFrame[RealNN](passengersData, response = targetColumn)
      
val featureVector = features.transmogrify()

val checkedFeatures = saleprice.sanityCheck(featureVector, removeBadFeatures = true)

val prediction = BinaryClassificationModelSelector.withCrossValidation(modelTypesToUse = Seq(
        OpXGBoostClassifier
      )).setInput(saleprice, checkedFeatures).getOutput()

val wf = new OpWorkflow()

val model = wf.setInputDataset(passengersData).setResultFeatures(passengerId, checkedFeatures,prediction).train()

val results = "Model summary:\n" + model.summaryPretty()
println(results)

Attached the log and the error
logxg.txt

@tovbinm
Copy link
Collaborator

tovbinm commented Nov 12, 2018

@albertodema I don't see any explicit errors emitted except the error status. I would recommend asking on https://github.com/dmlc/xgboost/issues

@albertodema
Copy link
Author

albertodema commented Nov 13, 2018

@tovbinm thanks for your input but they will likely to ask me how transmogrifai is calling their module , with which parameters , etc..
The error is happening on both IntelliJ IDE and in a spark cluster (standalone) on the titanic dataset, can you do a quick test with the code provided and if you have the same error share with me the way XGBoost has been called so I can raise a proper request to xgboost team?
Thanks,
Alberto.

@tovbinm
Copy link
Collaborator

tovbinm commented Nov 14, 2018

@albertodema I think it might be related this issue dmlc/xgboost#2449, since when we do cross validation we train multiple models in parallel. So I tried setting the parallelism to 1 - and the error still happens sometimes. So my bet that there is some race condition that happens which I am not sure how to track yet.

@CodingCat might have some ideas?

@CodingCat
Copy link

which version of xgb are you using?

@tovbinm
Copy link
Collaborator

tovbinm commented Nov 14, 2018

The latest - 0.81 with Spark 2.3.2.

@CodingCat
Copy link

Ok..we are supposed to have fixed this issue in 0.81...and I can actually run cross validation without any issue...can you provide a way to reproduce consistenly?

@albertodema
Copy link
Author

albertodema commented Nov 15, 2018

Here the code (use the following commit d0785f0 , the input file is here (the arg(0) parameter):
https://github.com/salesforce/TransmogrifAI/blob/master/helloworld/src/main/resources/TitanicDataset/TitanicPassengersTrainData.csv

import com.salesforce.op._
import com.salesforce.op.features.FeatureBuilder
import com.salesforce.op.features.types._
import com.salesforce.op.readers.DataReaders
import com.salesforce.op.stages.impl.classification.BinaryClassificationModelsToTry.{ OpXGBoostClassifier}
import com.salesforce.op.stages.impl.classification._
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.log4j.{Level, LogManager}

/**
 * A minimal Titanic Survival example with TransmogrifAI
 */
object OpTitanicMini {

  case class Passenger
  (
    id: Long,
    survived: Double,
    pClass: Option[Long],
    name: Option[String],
    sex: Option[String],
    age: Option[Double],
    sibSp: Option[Long],
    parCh: Option[Long],
    ticket: Option[String],
    fare: Option[Double],
    cabin: Option[String],
    embarked: Option[String]
  )

  def main(args: Array[String]): Unit = {
    LogManager.getLogger("com.salesforce.op").setLevel(Level.ERROR)
    implicit val spark = SparkSession.builder.config(new SparkConf()).getOrCreate()
    import spark.implicits._

    // Read Titanic data as a DataFrame
    val pathToData = Option(args(0))
    val passengersData = DataReaders.Simple.csvCase[Passenger](pathToData, key = _.id.toString).readDataset().toDF()
   
    // Automated feature engineering
    val (survived, features) = FeatureBuilder.fromDataFrame[RealNN](passengersData, response = "survived")
  val passengerId = features.find(_.name == "id").map(_.asInstanceOf[FeatureLike[Integral]]).get
    val featureVector = features.transmogrify()

    // Automated feature selection
    val checkedFeatures = survived.sanityCheck(featureVector, checkSample = 1.0, removeBadFeatures = true)

    // Automated model selection
    val prediction = BinaryClassificationModelSelector
      .withCrossValidation(modelTypesToUse = Seq(OpXGBoostClassifier))
      .setInput(survived, checkedFeatures).getOutput()
    val model = new OpWorkflow().setInputDataset(passengersData).setResultFeatures(passengerId, checkedFeatures,prediction).train()

    println("Model summary:\n" + model.summaryPretty())
  }

}

@CodingCat
Copy link

@albertodema I will start looking into this...where did you run this, a laptop or a cluster?

@albertodema
Copy link
Author

@CodingCat on a laptop with IntelliJ first than inside a docker container, I tried to launch spark also in single core mode but the error happens the same.

@tovbinm
Copy link
Collaborator

tovbinm commented Nov 22, 2018

Here is how to reproduce. I train 10 xgboost models in parallel and it fails:

val sparse = RandomVector.sparse(RandomReal.uniform[Real](), 1000).take(10000)
val labels = RandomBinary(0.5).withProbabilityOfEmpty(0.0).take(10000).map(b => b.toDouble.toRealNN(0.0))
val sample = sparse.zip(labels).toSeq
val (data, features, label) = TestFeatureBuilder(sample)

(1 to 10).par.map { _ =>
  val x = new XGBoostClassifier().setLabelCol(label.name).setFeaturesCol(features.name)
  x.set(x.trackerConf, TrackerConf(0L, "scala"))
  val xm = x.fit(data)
  val xtransformed = xm.transform(data)
  xtransformed.show()
}

Error:

ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed

	at ml.dmlc.xgboost4j.scala.spark.XGBoost$.ml$dmlc$xgboost4j$scala$spark$XGBoost$$postTrackerReturnProcessing(XGBoost.scala:364)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$4.apply(XGBoost.scala:294)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$4.apply(XGBoost.scala:256)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.immutable.List.map(List.scala:296)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:255)
	at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:200)
	at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:48)
	at org.apache.spark.ml.Predictor.fit(Predictor.scala:118)

@CodingCat
Copy link

so it only happens with parallel model training?

@tovbinm
Copy link
Collaborator

tovbinm commented Nov 22, 2018

With parallel execution it is constantly reproducible. Sometimes it also comes up when training multiple models sequentially, but it's rather rare.

@zhenchuan
Copy link

Is this question still being followed up? I also encountered the same problem.

@tovbinm
Copy link
Collaborator

tovbinm commented Feb 14, 2019

Yes, we are aware of the problem, but we were unable to track down the reason for it yet. Perhaps you want to look into it? @zhenchuan this would be a very valuable contribution :)

@timsetsfire
Copy link

timsetsfire commented Feb 22, 2019

Is it possible related to this dmlc/xgboost#4054

I had instances where it would sometimes work and sometimes wouldn't (within transmogrifai). So i went to just a vanilla xgboost-spark and found the same thing (in both staigth model training and crossvalidation). Training would fail, and then there would be an issue with dead letters.

@tovbinm
Copy link
Collaborator

tovbinm commented Mar 5, 2019

@timsetsfire thanks. I will give it a try. Also xgboost 0.82 is out and might have some related fixes 🤞

@tovbinm
Copy link
Collaborator

tovbinm commented Mar 5, 2019

Same error persists also with xgboost 0.82

Here is another error of the same type dmlc/xgboost#3418

@CodingCat any suggestions on how to overcome it?

@CodingCat
Copy link

CodingCat commented Mar 6, 2019

are you actually using scala-version of rabit tracker?

@CodingCat
Copy link

CodingCat commented Mar 6, 2019

@chenqin

@tovbinm
Copy link
Collaborator

tovbinm commented Mar 6, 2019

Yes, it fails with Scala tracker (Python implementation of rabbit tracker on Databricks works great).

@CodingCat
Copy link

ah......scala tracker.....out of maintenance for a while......

@tovbinm
Copy link
Collaborator

tovbinm commented Mar 7, 2019

Thanks @wsuchy and @CodingCat

@tovbinm
Copy link
Collaborator

tovbinm commented Mar 7, 2019

We will update project with the upcoming 0.83 (once available).
@albertodema for now please use TrackerConf(0L, "python") and follow the instructions on XGBoost project page on how to setup Python RabitTracker.

@tovbinm tovbinm closed this as completed Mar 7, 2019
@tovbinm tovbinm mentioned this issue Jul 11, 2019
@shenzgang
Copy link

shenzgang commented Sep 25, 2019

Hello, I also encountered the same problem, I use a spark - 2.3.2, xgboost4j - spark used is 0.90, and then throw model training failure (ml. DMLC. Xgboost4j. Java. XGBoostError: XGBoostModel training failed).

Container id: container_e03_1568625988058_7223_01_000004
Exit code: 255
Stack trace: ExitCodeException exitCode=255: 
	at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
	at org.apache.hadoop.util.Shell.run(Shell.java:456)
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
	at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
2019-09-25 10:31:13 [WARN] Model OpXGBoostClassifier attempted in model selector with failed with following issue: 
com.salesforce.op.stages.impl.tuning.OpValidator$$anonfun$9$$anonfun$10$$anonfun$apply$1.applyOrElse(OpValidator.scala:326)
org.apache.spark.SparkException: Job 45 cancelled because SparkContext was shut down
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:837)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:835)
	at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
	at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:835)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1848)
	at org.apache.spark.util.EventLoop.stop(EventLoop.scala:83)
	at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1761)
	at org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1931)
	at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1361)
	at org.apache.spark.SparkContext.stop(SparkContext.scala:1930)
	at org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:573)
	at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
	at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
	at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
	at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1992)
	at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
	at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
	at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
	at scala.util.Try$.apply(Try.scala:192)
	at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
	at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
	at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
	at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1358)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.take(RDD.scala:1331)
	at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1372)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.first(RDD.scala:1371)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$.ml$dmlc$xgboost4j$scala$spark$XGBoost$$postTrackerReturnProcessing(XGBoost.scala:352)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$4.apply(XGBoost.scala:294)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$4.apply(XGBoost.scala:256)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.immutable.List.map(List.scala:285)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:255)
	at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:200)
	at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:48)
	at org.apache.spark.ml.Predictor.fit(Predictor.scala:118)
	at com.salesforce.op.stages.sparkwrappers.specific.OpPredictorWrapper.fit(OpPredictorWrapper.scala:99)
	at com.salesforce.op.stages.sparkwrappers.specific.OpPredictorWrapper.fit(OpPredictorWrapper.scala:67)
	at org.apache.spark.ml.Estimator.fit(Estimator.scala:61)
	at com.salesforce.op.stages.impl.tuning.OpValidator$$anonfun$9$$anonfun$10$$anonfun$apply$3.apply(OpValidator.scala:321)
	at com.salesforce.op.stages.impl.tuning.OpValidator$$anonfun$9$$anonfun$10$$anonfun$apply$3.apply(OpValidator.scala:320)
	at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
	at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

My code is as follows:

val (response,feature) = FeatureBuilder.fromDataFrame[RealNN](frame,label)
        println(s"response = ${response}")
        val features = feature.dropWhile{case x=>x.name==id}
        println("============== opFeatures ==============")
        features.foreach(println(_))

        val transmogrifyFeature = features.transmogrify()
        val checkedFeature = response.sanityCheck(transmogrifyFeature,removeBadFeatures = true)

        val prediction = BinaryClassificationModelSelector.withTrainValidationSplit(
            modelTypesToUse = Seq(OpXGBoostClassifier)
        ).setInput(response, checkedFeature).getOutput()
        
        val evaluator = Evaluators.BinaryClassification().setLabelCol(label).setPredictionCol(prediction)
        val workflow = new OpWorkflow().setInputDataset(frame,(row: Row)=>row.get(0).toString).setResultFeatures(prediction)
        println("============training===========")
        val model = workflow.train()
        println(s"Model Summary:\n ${model.summaryPretty()}")

Thank you for reading and look forward to your reply,thanks!

@tovbinm
Copy link
Collaborator

tovbinm commented Sep 25, 2019

@zhenchuan which TransmogrifAI version are you using?

@shenzgang
Copy link

@tovbinm Hello, I am using version 0.60

@tovbinm
Copy link
Collaborator

tovbinm commented Sep 25, 2019

@shenzgang XGBoost fix to this issue comes with this PR - #402. So you can either try compiling your local version of TransmogrifAI by pulling the repo, checkout the branch revert-399-mt/revert-spark-2.4 and then ./gradlew publishToMavenLocal.

Or you can wait until we released the next version of TransmogrifAI. Perhaps @gerashegalov @leahmcguire can comment out when to be precise.

@shenzgang
Copy link

@tovbinm and when I use OpNaiveBayes for dichotomous cross-training, I throw the following exceptions:
Caused by: java.lang.IllegalArgumentException: requirement failed: Naive Bayes requires nonnegative feature values but found (2078,[0,1,2,3,4,6,7,8,10,11,12,13,15,21,26,30,118,460,599,1021,1022,1348,1356,1393,1948],[0.2588190451025203,-0.9659258262890684,-0.22252093395631434,0.9749279121818236,1.0,-0.861701759948068,0.5074150932938458,1205.0,4.0,6.0,33.0,76.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]).
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.ml.classification.NaiveBayes$.requireNonnegativeValues(NaiveBayes.scala:232)
at org.apache.spark.ml.classification.NaiveBayes$$anonfun$4.apply(NaiveBayes.scala:140)
at org.apache.spark.ml.classification.NaiveBayes$$anonfun$4.apply(NaiveBayes.scala:140)
at org.apache.spark.ml.classification.NaiveBayes$$anonfun$7.apply(NaiveBayes.scala:165)
at org.apache.spark.ml.classification.NaiveBayes$$anonfun$7.apply(NaiveBayes.scala:163)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$aggregateByKey$1$$anonfun$apply$6.apply(PairRDDFunctions.scala:172)
at org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:189)
at org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:188)
at org.apache.spark.util.collection.AppendOnlyMap.changeValue(AppendOnlyMap.scala:144)
at org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(SizeTrackingAppendOnlyMap.scala:32)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:194)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
... 3 more
Some data sets have this error. Will you try other algorithms to continue training when OpNaiveBayes training fails? Or throw an exception?,
my code modelTypesToUse=Seq(OpLogisticRegression,
OpRandomForestClassifier,
OpGBTClassifier,
OpLinearSVC,
OpNaiveBayes,
OpDecisionTreeClassifier)

@tovbinm
Copy link
Collaborator

tovbinm commented Sep 25, 2019

I think this might have been fixed in this PR - #404

Try using TransmogrifAI 0.6.1 release

@shenzgang
Copy link

Ok, thanks! I'll keep following transmogrifai!

@leahmcguire
Copy link
Collaborator

TransmogrifAI 0.6.1 was released 2 weeks ago. Are you asking when we will release with the updated spark version?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants
@tovbinm @CodingCat @timsetsfire @zhenchuan @leahmcguire @albertodema @shenzgang and others