Installing DataFu
DataFu is a collection of Apache Pig UDFs (User-Defined Functions) for statistical evaluation. They were developed by LinkedIn and are now open source under an Apache 2.0 license.
A number of usage examples and other information are available at https://github.com/linkedin/datafu.
To Use DataFu in a Parcel-deployed Cluster
If your cluster uses parcels, DataFu is installed for you. You need to register the JAR file prior to use with the following command.
REGISTER /opt/cloudera/parcels/CDH/lib/pig/datafu.jar
To Use DataFu in a Package-deployed Cluster:
- Install the DataFu package:
Operating system
Install command
Red-Hat-compatible
sudo yum install pig-udf-datafu
SLES
sudo zypper install pig-udf-datafu
Debian or Ubuntu
sudo apt-get install pig-udf-datafu
This puts the DataFu JAR file (for example, datafu-0.0.4-cdh5.0.0.jar) in /usr/lib/pig.
- Register the JAR. Replace the <component_version> string with the current DataFu and CDH version numbers.
REGISTER /usr/lib/pig/datafu-<DataFu_version>-cdh<CDH_version>.jar
For example:
REGISTER /usr/lib/pig/datafu-0.0.4-cdh5.0.0.jar