maven - Importing a lzo file into java spark as dataset -
i have data in tsv format compressed using lzo. now, use these data in java spark program.
at moment, able decompress files , import them in java text files using
sparksession spark = sparksession.builder() .master("local[2]") .appname("myname") .getorcreate(); dataset<row> input = spark.read() .option("sep", "\t") .csv(args[0]); input.show(5); // visually check if data imported correctly where have passed path decompressed file in first argument. if pass lzo file argument, result of show illegible garbage.
is there way make work? use intellij ide , project set-up in maven.
i found solution. consists of 2 parts: installing hadoop-lzo package , configuring it; after doing this, code remain same in question, provided 1 ok lzo file being imported in single partition.
in following explain how maven project set in intellij.
installing package hadoop-lzo: need modify
pom.xmlfile in maven project folder. should contain following excerpt:<repositories> <repository> <id>twitter-twttr</id> <url>http://maven.twttr.com</url> </repository> </repositories> <properties> <maven.compiler.source>1.8</maven.compiler.source> <maven.compiler.target>1.8</maven.compiler.target> </properties> <dependencies> <dependency> <!-- apache spark main library --> <groupid>org.apache.spark</groupid> <artifactid>spark-core_2.11</artifactid> <version>2.1.0</version> </dependency> <dependency> <!-- packages datasets , dataframes --> <groupid>org.apache.spark</groupid> <artifactid>spark-sql_2.11</artifactid> <version>2.1.0</version> </dependency> <!-- https://mvnrepository.com/artifact/com.hadoop.gplcompression/hadoop-lzo --> <dependency> <groupid>com.hadoop.gplcompression</groupid> <artifactid>hadoop-lzo</artifactid> <version>0.4.20</version> </dependency> </dependencies>
this activate maven twitter repository contains package hadoop-lzo , make hadoop-lzo available project.
the second step create
core-site.xmlfile tell hadoop have installed new codec. should placed somewhere in program folders. put undersrc/main/resources/core-site.xml, marked folder resource (right click on folder intellij project panel -> mark directory -> resources root).core-site.xmlfile should contain:<configuration> <property> <name>io.compression.codecs</name> <value>org.apache.hadoop.io.compress.defaultcodec, com.hadoop.compression.lzo.lzocodec, com.hadoop.compression.lzo.lzopcodec, org.apache.hadoop.io.compress.gzipcodec, org.apache.hadoop.io.compress.bzip2codec</value> </property> <property> <name>io.compression.codec.lzo.class</name> <value>com.hadoop.compression.lzo.lzocodec</value> </property> </configuration>
and that's it! run program again , should work!
Comments
Post a Comment