maven - Importing a lzo file into java spark as dataset -


i have data in tsv format compressed using lzo. now, use these data in java spark program.

at moment, able decompress files , import them in java text files using

    sparksession spark = sparksession.builder()             .master("local[2]")             .appname("myname")             .getorcreate();      dataset<row> input = spark.read()             .option("sep", "\t")             .csv(args[0]);      input.show(5);   // visually check if data imported correctly 

where have passed path decompressed file in first argument. if pass lzo file argument, result of show illegible garbage.

is there way make work? use intellij ide , project set-up in maven.

i found solution. consists of 2 parts: installing hadoop-lzo package , configuring it; after doing this, code remain same in question, provided 1 ok lzo file being imported in single partition.

in following explain how maven project set in intellij.

  • installing package hadoop-lzo: need modify pom.xml file in maven project folder. should contain following excerpt:

    <repositories>     <repository>         <id>twitter-twttr</id>         <url>http://maven.twttr.com</url>     </repository> </repositories>  <properties>     <maven.compiler.source>1.8</maven.compiler.source>     <maven.compiler.target>1.8</maven.compiler.target> </properties>  <dependencies>      <dependency>         <!-- apache spark main library -->         <groupid>org.apache.spark</groupid>         <artifactid>spark-core_2.11</artifactid>         <version>2.1.0</version>     </dependency>      <dependency>         <!-- packages datasets , dataframes -->         <groupid>org.apache.spark</groupid>         <artifactid>spark-sql_2.11</artifactid>         <version>2.1.0</version>     </dependency>      <!-- https://mvnrepository.com/artifact/com.hadoop.gplcompression/hadoop-lzo -->     <dependency>         <groupid>com.hadoop.gplcompression</groupid>         <artifactid>hadoop-lzo</artifactid>         <version>0.4.20</version>     </dependency>  </dependencies> 

this activate maven twitter repository contains package hadoop-lzo , make hadoop-lzo available project.

  • the second step create core-site.xml file tell hadoop have installed new codec. should placed somewhere in program folders. put under src/main/resources/core-site.xml , marked folder resource (right click on folder intellij project panel -> mark directory -> resources root). core-site.xml file should contain:

    <configuration>     <property>         <name>io.compression.codecs</name>         <value>org.apache.hadoop.io.compress.defaultcodec,             com.hadoop.compression.lzo.lzocodec,             com.hadoop.compression.lzo.lzopcodec,             org.apache.hadoop.io.compress.gzipcodec,             org.apache.hadoop.io.compress.bzip2codec</value>     </property>     <property>         <name>io.compression.codec.lzo.class</name>         <value>com.hadoop.compression.lzo.lzocodec</value>     </property> </configuration> 

and that's it! run program again , should work!


Comments

Popular posts from this blog

php - Vagrant up error - Uncaught Reflection Exception: Class DOMDocument does not exist -

vue.js - Create hooks for automated testing -

Add new key value to json node in java -