Improving write performance in Hive

visakh

I am performing various calculations (using UDFs) on Hive. The computations are pretty fast enough, but I am hitting a roadblock with the write performance in Hive. My result set is close to ten million records, and it takes a few minutes to write them to the table. I have experimented with cached tables and various file formats (ORC AND RC), but haven't seen any performance improvement.

Indexes are not possible since I am using Shark. It would be great to know the suggestions from the SO community on the various methods which I can try to improve the write performance.

Thanks, TM

aaronman

I don't really use shark since it is deprecated but I believe it has the ability to read and write parquet files just like spark SQL. In spark SQL it is trivial(from website):

val people: RDD[Person] = ... // An RDD of case class objects, from the previous example.

// The RDD is implicitly converted to a SchemaRDD, allowing it to be stored using Parquet.
people.saveAsParquetFile("people.parquet")

// Read in the parquet file created above.  Parquet files are self-describing so the schema is preserved.
// The result of loading a Parquet file is also a JavaSchemaRDD.
val parquetFile = sqlContext.parquetFile("people.parquet")

Basically parquet is your best bet at improving IO speeds without considering another framework (impala is supposed to be extremely fast, but queries are more limited). This is because if you have a table with many rows parquet allows you to only deserialize the needed rows since it is stored in a columnar format. In addition that deserialization maybe faster then with normal storage since storing data of the same types next to each other in memory can offer better compression rates. Also as I said in my comments it would be a good idea to upgrade to spark SQL since shark is no longer being supported and I don't believe there is much difference in terms of syntax.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related