scala - Load S3 files in parallel Spark -

August 15, 2015

i loading files spark, s3, through following code. it's working, noticing there delay between 1 file , another, , loaded sequentially. improve loading in parallel.

        // load files loaded firehose on day     var s3files = spark.sqlcontext.read.schema(schema).json("s3n://" + job.awsaccesskey + ":" + job.awssecretkey + "@" + job.bucketname + "/" + job.awss3rawfileexpression + "/" + year + "/" + monthcheck + "/" + daycheck + "/*/").rdd      // apply schema rdd, here have duplicates     val usersdataframe = spark.createdataframe(s3files, schema)      usersdataframe.createorreplacetempview("results")      // clean , use partition keys eliminate duplicates , latest record     var results = spark.sql(buildcleaningquery(job, "results"))     results.createorreplacetempview("filteredresults")     val records = spark.sql("select count(*) filteredresults")

i have tried loading through textfile() method, having problems converting rdd[string] rdd[row] because afterwards need move on use spark sql. using in following manner;

        var s3files = sparkcontext.textfile("s3n://" + job.awsaccesskey + ":" + job.awssecretkey + "@" + job.bucketname + "/" + job.awss3rawfileexpression + "/" + year + "/" + monthcheck + "/" + daycheck + "/*/").tojavardd()

what ideal manner load json files (multiple files around 50mb each) spark? validate properties against schema, later on able spark sql queries clean data.

what's going on dataframe being converted rdd , dataframe again, loses partitioning information.

var s3files = spark   .sqlcontext   .read.schema(schema)   .json(...)   .createorrepla‌cetempview("results"‌)

should sufficient, , partitioning information should still present, allowing json files loaded concurrently.

Search This Blog

Insert

scala - Load S3 files in parallel Spark -

Comments

Post a Comment

Popular posts from this blog

service - Android MediaPlayer calls onCompletion before it already finished -

javascript - Training Neural Network to play flappy bird with genetic algorithm - Why can't it learn? -

javascript - Create a stacked percentage column -