Apache Hive is a big data database that facilitates reading, writing, and managing large datasets residing in the distributed storage and queried using SQL syntax. Built on top of Apache Hadoop, hive enables easy access to data via SQL, thus enabling data warehousing tasks such as extract/transform/load (ETL), reporting, and data analysis.
Project concept
Apache Hive supports many in-built functions to manipulate and process the data. Though there are lot of available options, sometimes due to business use-cases, readily available functions may not be available. Hive allows you to extend and create User defined functions (UDFs) by extending the org.apache.hadoop.hive.ql.exec.UDF class.
The idea is to enhance the in-built functions available in Apache Hive and build new ones which could be added on. In this project, we will take some of the work around solutions in hive for some of the business use-cases and try to solve it by building custom hive UDFs.
Custom UDF List
In the first version of this project, we are releasing two of the custom udfs.
UDF-1.0: Find total occurrence of a word/character in a sentence
This custom UDF counts the total number of matching words in a sentence. It is useful particularly if you are trying to quickly filter out the number of words in a sentence or database columns.
For e.g. if you are trying to search for number of # (hash-tags) in a hive column of tweets, you can use this function to get you the total hash-tag counts
Documentation and Usage link for UDF1.0
UDF-2.0: Find total days minus the weekends between two dates
Subscribe to continue reading
Subscribe to get access to the rest of this post and other subscriber-only content.


One response to “Hive Custom UDFs – Project Introduction”