You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Our pipeline ran successfully with Spark 3.1.1 + XGBoost 1.1.1 in production. After upgrading to Spark 3.5.0, we tested multiple XGBoost versions (2.1.0-2.1.4) and consistently encountered the same Rabit tracker connection error during distributed training.
Error Description
Failure occurs when initializing distributed training:
ERROR XGBoostSpark: the job was aborted due to
ml.dmlc.xgboost4j.java.XGBoostError: [12:58:58] /workspace/src/collective/result.cc:78:
[tracker.cc:286|12:58:58]: Failed to accept connection.
[socket.h:89|12:58:58]: Invalid polling request.
Full stack trace shows the error originates from RabitTracker.stop() after connection rejection.
Reproduction Steps
Code:
val assembler = new VectorAssembler()
.setInputCols(Array("f1", "f2", "f3", "f4", "f5", "f6", "f7", "f8", "f9", "f10", "f11", "f12", "f13", "f14", "f15", "f16", "f17", "f18", "f19", "f20", "f21", "f22", "f23"))
.setOutputCol("features")
val labelIndexer = new StringIndexer()
.setInputCol("y")
.setOutputCol("indexedLabel")
.setHandleInvalid("skip")
.fit(training)
val booster = new XGBoostClassifier(
Map(
"eta" -> 0.1f,
"max_depth" -> 5,
"objective" -> "multi:softprob",
"num_class" -> 2,
"device" -> "cpu"
)
).setNumRound(10).setNumWorkers(2)
booster.setFeaturesCol("features")
booster.setLabelCol("indexedLabel")
val converter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("convetedPrediction")
.setLabels(labelIndexer.labelsArray(0))
val pipeline = new Pipeline()
.setStages(Array(assembler, labelIndexer, booster, converter))
println("ready to train...")
val model: PipelineModel = pipeline.fit(training) // stopped here
✅ Verified compatibility between Spark 3.5.0 and XGBoost 2.1.x
✅ Tested all minor versions of XGBoost 2.1.x series
❌ Adjusting tracker ports (tracker_conf) had no effect
❌ Increasing timeout (timeout parameter) failed
Key Questions
Is this a known issue with Spark 3.5.0’s network layer and XGBoost 2.1.x?
Are there specific configurations required for XGBoost 2.1.x + Spark 3.5.0?
Should we downgrade to Spark 3.1.x or wait for a XGBoost patch?
This template focuses on critical version conflicts and provides actionable context for maintainers.
25/03/31 12:58:58 ERROR XGBoostSpark: the job was aborted due to
ml.dmlc.xgboost4j.java.XGBoostError: [12:58:58] /workspace/src/collective/result.cc:78:
[tracker.cc:286|12:58:58]: Failed to accept connection.
at ml.dmlc.xgboost4j.java.XGBoostJNI.checkCall(XGBoostJNI.java:48)
at ml.dmlc.xgboost4j.java.RabitTracker.stop(RabitTracker.java:84)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.withTracker(XGBoost.scala:467)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:501)
at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:210)
at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:34)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:114)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:78)
at org.apache.spark.ml.Pipeline.$anonfun$fit$5(Pipeline.scala:151)
at org.apache.spark.ml.MLEvents.withFitEvent(events.scala:130)
at org.apache.spark.ml.MLEvents.withFitEvent$(events.scala:123)
at org.apache.spark.ml.util.Instrumentation.withFitEvent(Instrumentation.scala:42)
at org.apache.spark.ml.Pipeline.$anonfun$fit$4(Pipeline.scala:151)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at org.apache.spark.ml.Pipeline.$anonfun$fit$2(Pipeline.scala:147)
at org.apache.spark.ml.MLEvents.withFitEvent(events.scala:130)
at org.apache.spark.ml.MLEvents.withFitEvent$(events.scala:123)
at org.apache.spark.ml.util.Instrumentation.withFitEvent(Instrumentation.scala:42)
at org.apache.spark.ml.Pipeline.$anonfun$fit$1(Pipeline.scala:133)
at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
at scala.util.Try$.apply(Try.scala:213)
at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:133)
at Test$.main(Test.scala:59)
at Test.main(Test.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:738)
The text was updated successfully, but these errors were encountered:
We have been running tests with 3.5 but haven't observed similar error yet. The errors come from polling a UNIX TCP socket. I can't guess the cause based on the available information. Is there a way that we can reproduce your networking environment?
During debugging a Spark-XGBoost pipeline, I encountered a version-specific warning that never appeared in previous environments:
Warning Log:
WARN DAGScheduler: Creating new stage failed due to exception - job: 2
org.apache.spark.scheduler.BarrierJobRunWithDynamicAllocationException: [SPARK-24942]: Barrier execution mode does not support dynamic resource allocation for now. You can disable dynamic resource allocation by setting Spark conf "spark.dynamicAllocation.enabled" to "false".
Resolution: Disabling dynamic resource allocation resolved this conflict:
val sparkSession = SparkSession
.builder()
.appName("xgboostTest")
.enableHiveSupport()
.config("spark.dynamicAllocation.enabled", "false")
.getOrCreate()
Subsequently, a stricter data validation error emerged during training:
Error Log:
ERROR DataBatch: java.lang.RuntimeException: you can only specify missing value as 0.0 (the currently set value NaN) when you have SparseVector or Empty vector as your feature format. If you didn't use Spark's VectorAssembler class to build your feature vector but instead did so in a way that preserves zeros in your feature vector you can avoid this check by using the 'allow_non_zero_for_missing parameter' (only use if you know what you are doing)
Environment Details
Background
Our pipeline ran successfully with Spark 3.1.1 + XGBoost 1.1.1 in production. After upgrading to Spark 3.5.0, we tested multiple XGBoost versions (2.1.0-2.1.4) and consistently encountered the same Rabit tracker connection error during distributed training.
Error Description
Failure occurs when initializing distributed training:
Full stack trace shows the error originates from RabitTracker.stop() after connection rejection.
Reproduction Steps
spark-submit --master yarn --deploy-mode cluster ...
Attempted Fixes
✅ Verified compatibility between Spark 3.5.0 and XGBoost 2.1.x
✅ Tested all minor versions of XGBoost 2.1.x series
❌ Adjusting tracker ports (tracker_conf) had no effect
❌ Increasing timeout (timeout parameter) failed
Key Questions
This template focuses on critical version conflicts and provides actionable context for maintainers.
attaching log:
The text was updated successfully, but these errors were encountered: