when trying run on folder throwing me executorlostfailure everytime
hi beginner in spark. trying run job on spark 1.4.1 8 slave nodes 11.7 gb memory each 3.2 gb disk . running spark task 1 of slave node (from 8 nodes) (so 0.7 storage fraction approx 4.8 gb available on each node )and using mesos cluster manager. using configuration :
spark.master mesos://uc1f-bioinfocloud-vamp-m-1:5050 spark.eventlog.enabled true spark.driver.memory 6g spark.storage.memoryfraction 0.7 spark.core.connection.ack.wait.timeout 800 spark.akka.framesize 50 spark.rdd.compress true i trying run spark mllib naive bayes algorithm on folder around 14 gb of data. (there no issue when running task on 6 gb folder) reading folder google storage rdd , giving 32 partition parameter.(i have tried increasing partition well). using tf create feature vector , predict on basis of that. when trying run on folder throwing me executorlostfailure everytime. tried different configurations nothing helping. may missing basic not able figure out. or suggestion highly valuable.
log is:
15/07/21 01:18:20 error tasksetmanager: task 3 in stage 2.0 failed 4 times; aborting job 15/07/21 01:18:20 info taskschedulerimpl: cancelling stage 2 15/07/21 01:18:20 info taskschedulerimpl: stage 2 cancelled 15/07/21 01:18:20 info dagscheduler: resultstage 2 (collect @ /opt/work/v2processrecords.py:213) failed in 28.966 s 15/07/21 01:18:20 info dagscheduler: executor lost: 20150526-135628-3255597322-5050-1304-s8 (epoch 3) 15/07/21 01:18:20 info blockmanagermasterendpoint: trying remove executor 20150526-135628-3255597322-5050-1304-s8 blockmanagermaster. 15/07/21 01:18:20 info dagscheduler: job 2 failed: collect @ /opt/work/v2processrecords.py:213, took 29.013646 s traceback (most recent call last): file "/opt/work/v2processrecords.py", line 213, in <module> secondpassrdd = firstpassrdd.map(lambda ( name, title, idval, pmcid, pubdate, article, tags , author, ifsigmacust, wclass): ( str(name), title, idval, pmcid, pubdate, article, tags , author, ifsigmacust , "yes" if ("pmc" + pmcid) in rddnihgrant else ("no") , wclass)).collect() file "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 745, in collect file "/usr/local/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__ file "/usr/local/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.py4jjavaerror: error occurred while calling z:org.apache.spark.api.python.pythonrdd.collectandserve. : org.apache.spark.sparkexception: job aborted due stage failure: task 3 in stage 2.0 failed 4 times, recent failure: lost task 3.3 in stage 2.0 (tid 12, vamp-m-2.c.quantum-854.internal): executorlostfailure (executor 20150526-135628-3255597322-5050-1304-s8 lost) driver stacktrace: @ org.apache.spark.scheduler.dagscheduler.org$apache$spark$scheduler$dagscheduler$$failjobandindependentstages(dagscheduler.scala:1266) @ org.apache.spark.scheduler.dagscheduler$$anonfun$abortstage$1.apply(dagscheduler.scala:1257) @ org.apache.spark.scheduler.dagscheduler$$anonfun$abortstage$1.apply(dagscheduler.scala:1256) @ scala.collection.mutable.resizablearray$class.foreach(resizablearray.scala:59) @ scala.collection.mutable.arraybuffer.foreach(arraybuffer.scala:47) @ org.apache.spark.scheduler.dagscheduler.abortstage(dagscheduler.scala:1256) @ org.apache.spark.scheduler.dagscheduler$$anonfun$handletasksetfailed$1.apply(dagscheduler.scala:730) @ org.apache.spark.scheduler.dagscheduler$$anonfun$handletasksetfailed$1.apply(dagscheduler.scala:730) @ scala.option.foreach(option.scala:236) @ org.apache.spark.scheduler.dagscheduler.handletasksetfailed(dagscheduler.scala:730) @ org.apache.spark.scheduler.dagschedulereventprocessloop.onreceive(dagscheduler.scala:1450) @ org.apache.spark.scheduler.dagschedulereventprocessloop.onreceive(dagscheduler.scala:1411) @ org.apache.spark.util.eventloop$$anon$1.run(eventloop.scala:48) 15/07/21 01:18:20 info blockmanagermaster: removed 20150526-135628-3255597322-5050-1304-s8 in removeexecutor 15/07/21 01:18:20 info dagscheduler: host added in lost list earlier:vamp-m-2.c.quantum-854.internal jul 21, 2015 1:01:15 info: parquet.hadoop.parquetfilereader: initiating action parallelism: 5 15/07/21 01:18:20 info sparkcontext: invoking stop() shutdown hook {"event":"sparklistenertaskstart","stage id":2,"stage attempt id":0,"task info":{"task id":11,"index":6,"attempt":2,"launch time":1437616381852,"executor id":"20150526-135628-3255597322-5050-1304-s8","host":"uc1f-bioinfocloud-vamp-m-2.c.quantum-device-854.internal","locality":"process_local","speculative":false,"getting result time":0,"finish time":0,"failed":false,"accumulables":[]}} {"event":"sparklistenerexecutorremoved","timestamp":1437616389696,"executor id":"20150526-135628-3255597322-5050-1304-s8","removed reason":"lost executor"} {"event":"sparklistenertaskend","stage id":2,"stage attempt id":0,"task type":"resulttask","task end reason":{"reason":"executorlostfailure","executor id":"20150526-135628-3255597322-5050-1304-s8"},"task info":{"task id":11,"index":6,"attempt":2,"launch time":1437616381852,"executor id":"20150526-135628-3255597322-5050-1304-s8","host":"uc1f-bioinfocloud-vamp-m-2.c.quantum-device-854.internal","locality":"process_local","speculative":false,"getting result time":0,"finish time":1437616389697,"failed":true,"accumulables":[]}} {"event":"sparklistenerexecutoradded","timestamp":1437616389707,"executor id":"20150526-135628-3255597322-5050-1304-s8","executor info":{"host":"uc1f-bioinfocloud-vamp-m-2.c.quantum-device-854.internal","total cores":1,"log urls":{}}} {"event":"sparklistenertaskstart","stage id":2,"stage attempt id":0,"task info":{"task id":12,"index":6,"attempt":3,"launch time":1437616389702,"executor id":"20150526-135628-3255597322-5050-1304-s8","host":"uc1f-bioinfocloud-vamp-m-2.c.quantum-device-854.internal","locality":"process_local","speculative":false,"getting result time":0,"finish time":0,"failed":false,"accumulables":[]}} {"event":"sparklistenerexecutorremoved","timestamp":1437616397743,"executor id":"20150526-135628-3255597322-5050-1304-s8","removed reason":"lost executor"} {"event":"sparklistenertaskend","stage id":2,"stage attempt id":0,"task type":"resulttask","task end reason":{"reason":"executorlostfailure","executor id":"20150526-135628-3255597322-5050-1304-s8"},"task info":{"task id":12,"index":6,"attempt":3,"launch time":1437616389702,"executor id":"20150526-135628-3255597322-5050-1304-s8","host":"uc1f-bioinfocloud-vamp-m-2.c.quantum-device-854.internal","locality":"process_local","speculative":false,"getting result time":0,"finish time":1437616397743,"failed":true,"accumulables":[]}} {"event":"sparklistenerstagecompleted","stage info":{"stage id":2,"stage attempt id":0,"stage name":"collect @ /opt/work/v2processrecords.py:215","number of tasks":72,"rdd info":[{"rdd id":6,"name":"pythonrdd","parent ids":[0],"storage level":{"use disk":false,"use memory":false,"use externalblockstore":false,"deserialized":false,"replication":1},"number of partitions":72,"number of cached partitions":0,"memory size":0,"externalblockstore size":0,"disk size":0},{"rdd id":0,"name":"gs://uc1f-bioinfocloud-vamp-m/literature/xml/p*/*.nxml","scope":"{\"id\":\"0\",\"name\":\"wholetextfiles\"}","parent ids":[],"storage level":{"use disk":false,"use memory":false,"use externalblockstore":false,"deserialized":false,"replication":1},"number of partitions":72,"number of cached partitions":0,"memory size":0,"externalblockstore size":0,"disk size":0}],"parent ids":[],"details":"","submission time":1437616365566,"completion time":1437616397753,"failure reason":"job aborted due stage failure: task 6 in stage 2.0 failed 4 times, recent failure: lost task 6.3 in stage 2.0 (tid 12, uc1f-bioinfocloud-vamp-m-2.c.quantum-device-854.internal): executorlostfailure (executor 20150526-135628-3255597322-5050-1304-s8 lost)\ndriver stacktrace:","accumulables":[]}} {"event":"sparklistenerjobend","job id":2,"completion time":1437616397755,"job result":{"result":"jobfailed","exception":{"message":"job aborted due stage failure: task 6 in stage 2.0 failed 4 times, recent failure: lost task 6.3 in stage 2.0 (tid 12, uc1f-bioinfocloud-vamp-m-2.c.quantum-device-854.internal): executorlostfailure (executor 20150526-135628-3255597322-5050-1304-s8 lost)\ndriver stacktrace:","stack trace":[{"declaring class":"org.apache.spark.scheduler.dagscheduler","method name":"org$apache$spark$scheduler$dagscheduler$$failjobandindependentstages","file name":"dagscheduler.scala","line number":1266},{"declaring class":"org.apache.spark.scheduler.dagscheduler$$anonfun$abortstage$1","method name":"apply","file name":"dagscheduler.scala","line number":1257},{"declaring class":"org.apache.spark.scheduler.dagscheduler$$anonfun$abortstage$1","method name":"apply","file name":"dagscheduler.scala","line number":1256},{"declaring class":"scala.collection.mutable.resizablearray$class","method name":"foreach","file name":"resizablearray.scala","line number":59},{"declaring class":"scala.collection.mutable.arraybuffer","method name":"foreach","file name":"arraybuffer.scala","line number":47},{"declaring class":"org.apache.spark.scheduler.dagscheduler","method name":"abortstage","file name":"dagscheduler.scala","line number":1256},{"declaring class":"org.apache.spark.scheduler.dagscheduler$$anonfun$handletasksetfailed$1","method name":"apply","file name":"dagscheduler.scala","line number":730},{"declaring class":"org.apache.spark.scheduler.dagscheduler$$anonfun$handletasksetfailed$1","method name":"apply","file name":"dagscheduler.scala","line number":730},{"declaring class":"scala.option","method name":"foreach","file name":"option.scala","line number":236},{"declaring class":"org.apache.spark.scheduler.dagscheduler","method name":"handletasksetfailed","file name":"dagscheduler.scala","line number":730},{"declaring class":"org.apache.spark.scheduler.dagschedulereventprocessloop","method name":"onreceive","file name":"dagscheduler.scala","line number":1450},{"declaring class":"org.apache.spark.scheduler.dagschedulereventprocessloop","method name":"onreceive","file name":"dagscheduler.scala","line number":1411},{"declaring class":"org.apache.spark.util.eventloop$$anon$1","method name":"run","file name":"eventloop.scala","line number":48}]}}}
this error occurring because task failed more 4 times. try increase parallelism in cluster using following parameter.
--conf "spark.default.parallelism=100" set parallelism value 2 3 time number of cores available on cluster. if doesn't work. try increase parallelism in exponential fashion. i.e if current parallelism doesn't work multiply 2 , on. have observed helps if level of parallelism prime number if using groupbykkey.
Comments
Post a Comment