Test drive for PROC HADOOP and Pig PROC HADOOP is available since SAS 9.3M2, which bridges a Windows client and a Hadoop server. The great thing about this procedure is that it supports user-defined function. There are several steps to apply this procedure.Download Java SE and Eclipse on WindowsJava SE and Eclipse are free to download. Installation is also fairly easy.Make user-defined function on WindowsThe most basic user-defined function is an upper-case function for a string that wraps Java’s native str.toUpperCase() function. Pig’s manual has [detail descripton][1] about it.Package the function as JARThere is a wonderful video tutorial on YouTube. Make sure that version of the [Pig API][2] with ...
An alternative way to use SAS and Hadoop together The challenges for SAS in HadoopFor analytics tasks on the data stored on Hadoop, Python or R are freewares and easily installed in each data node of a Hadoop cluster. Then some open source frameworks for Python and R, or the simple Hadoop streaming would utilize the full strength of them on Hadoop. On the contrary, SAS is a proprietary software. A company may be reluctant to buy many yearly-expired licenses for a Hadoop cluster that is built on cheap commodity hardwares, and a cluster administrator will feel technically difficult to implement SAS for hundreds of the nodes. Therefore, the traditional ETL pipeline to pull ...
PROC PLS and multicollinearity Multicollinearity and its consequencesMulticollinearity usually brings significant challenges to a regression model by using either normal equation or gradient descent.1. Invertible SSCP for normal equationAccording to normal equation, the coefficients could be obtained by . If the SSCP turns to be singular and non-invertible due to multicollinearity, then the coefficients are theoretically not solvable.2. Unstable solution for gradient descentThe gradient descent algorithm seeks to use iterative methods to minimize residual sum of squares (RSS). For example, as the plot above shows, if there is strong relationship between two regressors in a regression, many possible combinations of  and  lie along a narrow valley, which all ...
Use R in Hadoop by streaming It seems that the combination of R and Hadoop is a must-have toolkit for people working with both statistics and large data set.An aggregation exampleThe Hadoop version used here is Cloudera’s CDH4, and the underlying Linux OS is CentOS 6. The data used is a simulated sales data set form a training course by Udacity. Format of each line of the data set is: date, time, store name, item description, cost and method of payment. The six fields are separated by tab. Only two fields, store and cost, are used to aggregate the cost by each store.A typical MapReduce job contains two ...
Kernel selection in PROC SVM The support vector machine (SVM) is a flexible classification or regression method by using its many kernels. To apply a SVM, we possibly need to specify a kernel, a regularization parameter c and some kernel parameters like gamma. Besides the selection of regularization parameter c in my previous post, the SVM procedure and the iris flower data set are used here to discuss the kernel selection in SAS.Exploration of the iris flower dataThe iris data is classic for classification exercise. If we use the first two components from Principle Component Analysis (PCA) to compress the four predictors, petal length, petal width, sepal length, ...