maven - Importing a lzo file into java spark as dataset -
i have data in tsv format compressed using lzo. now, use these data in java spark program.
at moment, able decompress files , import them in java text files using
sparksession spark = sparksession.builder() .master("local[2]") .appname("myname") .getorcreate(); dataset<row> input = spark.read() .option("sep", "\t") .csv(args[0]); input.show(5); // visually check if data imported correctly
where have passed path decompressed file in first argument. if pass lzo file argument, result of show illegible garbage.
is there way make work? use intellij ide , project set-up in maven.
i found solution. consists of 2 parts: installing hadoop-lzo package , configuring it; after doing this, code remain same in question, provided 1 ok lzo file being imported in single partition.
in following explain how maven project set in intellij.
installing package hadoop-lzo: need modify
pom.xml
file in maven project folder. should contain following excerpt:<repositories> <repository> <id>twitter-twttr</id> <url>http://maven.twttr.com</url> </repository> </repositories> <properties> <maven.compiler.source>1.8</maven.compiler.source> <maven.compiler.target>1.8</maven.compiler.target> </properties> <dependencies> <dependency> <!-- apache spark main library --> <groupid>org.apache.spark</groupid> <artifactid>spark-core_2.11</artifactid> <version>2.1.0</version> </dependency> <dependency> <!-- packages datasets , dataframes --> <groupid>org.apache.spark</groupid> <artifactid>spark-sql_2.11</artifactid> <version>2.1.0</version> </dependency> <!-- https://mvnrepository.com/artifact/com.hadoop.gplcompression/hadoop-lzo --> <dependency> <groupid>com.hadoop.gplcompression</groupid> <artifactid>hadoop-lzo</artifactid> <version>0.4.20</version> </dependency> </dependencies>
this activate maven twitter repository contains package hadoop-lzo , make hadoop-lzo available project.
the second step create
core-site.xml
file tell hadoop have installed new codec. should placed somewhere in program folders. put undersrc/main/resources/core-site.xml
, marked folder resource (right click on folder intellij project panel -> mark directory -> resources root).core-site.xml
file should contain:<configuration> <property> <name>io.compression.codecs</name> <value>org.apache.hadoop.io.compress.defaultcodec, com.hadoop.compression.lzo.lzocodec, com.hadoop.compression.lzo.lzopcodec, org.apache.hadoop.io.compress.gzipcodec, org.apache.hadoop.io.compress.bzip2codec</value> </property> <property> <name>io.compression.codec.lzo.class</name> <value>com.hadoop.compression.lzo.lzocodec</value> </property> </configuration>
and that's it! run program again , should work!
Comments
Post a Comment