why spark still slow than mysql? -


i trying work apache spark data source mysql. have cluster having 1 master , 1 slave node , both have 8 gb ram , 2 cores submitting sql query spark using spark-shell , table having 6402821 many rows. performing group by onto table. , time taken mysql 5.2secs , using spark when performing query time 21secs. why happening?

i setting configurations partitioncolumn, upperbound, lowerbound, , numofpartitions still no change.

i have tried executing query using 1,2,4 cores time taken spark same 21secs.

is problem occurs because of mysql database on single machine , spark nodes try query on data onto single machine?

can 1 me solve issue?

the database having table called demo_call_stats on trying query is:

val jdbcdf = spark.read.format("jdbc").options( map("url" ->  "jdbc:mysql://192.168.0.31:3306/cmanalytics?user=root&password=","zerodatetimebehaviour"->"converttonull", "dbtable" -> "cmanalytics.demo_call_stats", "fetchsize" -> "10000", "partitioncolumn" -> "newpartition", "lowerbound" -> "0", "upperbound" -> "4", "numpartitions" -> "4")).load()  jdbcdf.createorreplacetempview("call_stats")  val sqldf = sql("select count(*), classification_id call_stats campaign_id = 77 group classification_id")  sqldf.show() 

any appreciated.

thanks

there couple of things should understand here:

despite might have heard, spark isn't 'faster mysql', because kind of generality doesn't mean anything. spark faster mysql queries, , mysql faster spark others. speaking, mysql relational database, meaning has been conceived serve backend application. optimized access records efficiently long indexed.

when thinking databases, think of them library 1 librarian books want (i speaking old school library here, without computer librarian).

if ask librarian: "i want know how many books have geopolitics", librarian can go geopolitics shelf , count number of books on shelf.

if ask librarian: "i want know how many books have have @ least 500 pages", librarian have @ every single book in library answer query. in sql called full table scans. of course can have several librarians (processors) working on query go faster, cannot have more few of them (let's 16) inside library (computer).

now, spark has been designed handle large volume of data, namely libraries big won't fit single buildings, , if does, many 16 librarians take days @ them answer second query.

what makes spark faster mysql this: if put books in several buildings, can have 16 librarians per building working on answer. can handle larger number of books.

also, since spark made answer second type of queries rather queries "please bring me 'the portrait of dorian gray', oscar wilde", means spark doesn't care, @ least default, sort books in particular way. means if want find particular book spark, librarians have go through entire library find it.

of course, spark uses many other type of optimisations perform queries more efficently, indexation not 1 of them. (if familiar notion of primary key in mysql, there no such thing in spark) othe optimisations include storage format parquet , orc allow read columns useful answer queries, , compression (e.g. snappy), aimed @ increasing number of books can fit in library without having push walls.

i hope metaphor helped you, please bear in mind metaphor , doesn't fit reality perfectly.

now, question specific details:

assuming campaign_id primary key or created index on column, mysql have read rows campaign_id = 77. on other hand, spark have ask mysql send rows in table spark. if spark clever, ask 1 campaign_id = 77, , maybe send multiple queries mysql ranges in parallel. means data mysql read , aggregate have serialized, sent spark, , aggregated spark. hope see why should take longer.

if want spark answer queries faster mysql, should try copying table in format this.

// replace line : // jdbcdf.createorreplacetempview("call_stats") // : jdbcdf.write.format("orc").saveastable("call_stats") 

another thing try caching data this:

jdbcdf.cache().createorreplacetempview("call_stats") 

caching won't bring improvement first query cache data while performing it, if continue querying same view, might faster. explained above, doesn't mean spark faster mysql everything.

for small data , local deployements, can perf improvement changing configuration parameter: spark.sql.shuffle.partitions=4 200 default.

hope helps.


Comments

Popular posts from this blog

php - Vagrant up error - Uncaught Reflection Exception: Class DOMDocument does not exist -

vue.js - Create hooks for automated testing -

Add new key value to json node in java -