scala - Load S3 files in parallel Spark -


i loading files spark, s3, through following code. it's working, noticing there delay between 1 file , another, , loaded sequentially. improve loading in parallel.

        // load files loaded firehose on day     var s3files = spark.sqlcontext.read.schema(schema).json("s3n://" + job.awsaccesskey + ":" + job.awssecretkey + "@" + job.bucketname + "/" + job.awss3rawfileexpression + "/" + year + "/" + monthcheck + "/" + daycheck + "/*/").rdd      // apply schema rdd, here have duplicates     val usersdataframe = spark.createdataframe(s3files, schema)      usersdataframe.createorreplacetempview("results")      // clean , use partition keys eliminate duplicates , latest record     var results = spark.sql(buildcleaningquery(job, "results"))     results.createorreplacetempview("filteredresults")     val records = spark.sql("select count(*) filteredresults") 

i have tried loading through textfile() method, having problems converting rdd[string] rdd[row] because afterwards need move on use spark sql. using in following manner;

        var s3files = sparkcontext.textfile("s3n://" + job.awsaccesskey + ":" + job.awssecretkey + "@" + job.bucketname + "/" + job.awss3rawfileexpression + "/" + year + "/" + monthcheck + "/" + daycheck + "/*/").tojavardd() 

what ideal manner load json files (multiple files around 50mb each) spark? validate properties against schema, later on able spark sql queries clean data.

what's going on dataframe being converted rdd , dataframe again, loses partitioning information.

var s3files = spark   .sqlcontext   .read.schema(schema)   .json(...)   .createorrepla‌​cetempview("results"‌​) 

should sufficient, , partitioning information should still present, allowing json files loaded concurrently.


Comments

Popular posts from this blog

javascript - Create a stacked percentage column -

Optimising Firebase database by automatically overwriting data -

javascript - Angular UI-Grid customTemplate directive causing rows to load slowly/? -