maven - Importing a lzo file into java spark as dataset -

June 15, 2011

i have data in tsv format compressed using lzo. now, use these data in java spark program.

at moment, able decompress files , import them in java text files using

    sparksession spark = sparksession.builder()             .master("local[2]")             .appname("myname")             .getorcreate();      dataset<row> input = spark.read()             .option("sep", "\t")             .csv(args[0]);      input.show(5);   // visually check if data imported correctly

where have passed path decompressed file in first argument. if pass lzo file argument, result of show illegible garbage.

is there way make work? use intellij ide , project set-up in maven.

i found solution. consists of 2 parts: installing hadoop-lzo package , configuring it; after doing this, code remain same in question, provided 1 ok lzo file being imported in single partition.

in following explain how maven project set in intellij.

installing package hadoop-lzo: need modify pom.xml file in maven project folder. should contain following excerpt:

<repositories>     <repository>         <id>twitter-twttr</id>         <url>http://maven.twttr.com</url>     </repository> </repositories>  <properties>     <maven.compiler.source>1.8</maven.compiler.source>     <maven.compiler.target>1.8</maven.compiler.target> </properties>  <dependencies>      <dependency>         <!-- apache spark main library -->         <groupid>org.apache.spark</groupid>         <artifactid>spark-core_2.11</artifactid>         <version>2.1.0</version>     </dependency>      <dependency>         <!-- packages datasets , dataframes -->         <groupid>org.apache.spark</groupid>         <artifactid>spark-sql_2.11</artifactid>         <version>2.1.0</version>     </dependency>      <!-- https://mvnrepository.com/artifact/com.hadoop.gplcompression/hadoop-lzo -->     <dependency>         <groupid>com.hadoop.gplcompression</groupid>         <artifactid>hadoop-lzo</artifactid>         <version>0.4.20</version>     </dependency>  </dependencies>

this activate maven twitter repository contains package hadoop-lzo , make hadoop-lzo available project.

the second step create core-site.xml file tell hadoop have installed new codec. should placed somewhere in program folders. put under src/main/resources/core-site.xml , marked folder resource (right click on folder intellij project panel -> mark directory -> resources root). core-site.xml file should contain:

<configuration>     <property>         <name>io.compression.codecs</name>         <value>org.apache.hadoop.io.compress.defaultcodec,             com.hadoop.compression.lzo.lzocodec,             com.hadoop.compression.lzo.lzopcodec,             org.apache.hadoop.io.compress.gzipcodec,             org.apache.hadoop.io.compress.bzip2codec</value>     </property>     <property>         <name>io.compression.codec.lzo.class</name>         <value>com.hadoop.compression.lzo.lzocodec</value>     </property> </configuration>

and that's it! run program again , should work!

Search This Blog

Insert

maven - Importing a lzo file into java spark as dataset -

Comments

Post a Comment

Popular posts from this blog

vue.js - Create hooks for automated testing -

php - Vagrant up error - Uncaught Reflection Exception: Class DOMDocument does not exist -

serial port - hub4com OVERRUN Error -