An alternative way to use SAS and Hadoop together The challenges for SAS in HadoopFor analytics tasks on the data stored on Hadoop, Python or R are freewares and easily installed in each data node of a Hadoop cluster. Then some open source frameworks for Python and R, or the simple Hadoop streaming would utilize the full strength of them on Hadoop. On the contrary, SAS is a proprietary software. A company may be reluctant to buy many yearly-expired licenses for a Hadoop cluster that is built on cheap commodity hardwares, and a cluster administrator will feel technically difficult to implement SAS for hundreds of the nodes. Therefore, the traditional ETL pipeline to pull ...
PROC PLS and multicollinearity Multicollinearity and its consequencesMulticollinearity usually brings significant challenges to a regression model by using either normal equation or gradient descent.1. Invertible SSCP for normal equationAccording to normal equation, the coefficients could be obtained by . If the SSCP turns to be singular and non-invertible due to multicollinearity, then the coefficients are theoretically not solvable.2. Unstable solution for gradient descentThe gradient descent algorithm seeks to use iterative methods to minimize residual sum of squares (RSS). For example, as the plot above shows, if there is strong relationship between two regressors in a regression, many possible combinations of  and  lie along a narrow valley, which all ...
Use R in Hadoop by streaming It seems that the combination of R and Hadoop is a must-have toolkit for people working with both statistics and large data set.An aggregation exampleThe Hadoop version used here is Cloudera’s CDH4, and the underlying Linux OS is CentOS 6. The data used is a simulated sales data set form a training course by Udacity. Format of each line of the data set is: date, time, store name, item description, cost and method of payment. The six fields are separated by tab. Only two fields, store and cost, are used to aggregate the cost by each store.A typical MapReduce job contains two ...
Kernel selection in PROC SVM The support vector machine (SVM) is a flexible classification or regression method by using its many kernels. To apply a SVM, we possibly need to specify a kernel, a regularization parameter c and some kernel parameters like gamma. Besides the selection of regularization parameter c in my previous post, the SVM procedure and the iris flower data set are used here to discuss the kernel selection in SAS.Exploration of the iris flower dataThe iris data is classic for classification exercise. If we use the first two components from Principle Component Analysis (PCA) to compress the four predictors, petal length, petal width, sepal length, ...
Top 10 most powerful functions for PROC SQL ABSTRACTPROC SQL is not only one of the many SAS procedures and also a distinctive subsystem with all common features from SQL (Structured Query Language). Equipped with PROC SQL, SAS upgrades to a full-fledging relational database management system. PROC SQL provides alternative ways to manage data other than the traditional DATA Step and SAS procedures. In addition, SAS’s built-in functions are the add-on tools to increase the power of PROC SQL. In this paper, we illustrate ten popular SAS functions, which facilitate the capacity of PROC SQL in data management and descriptive statistics.INTRODUCTIONStructured Query Language (SQL) is a universal computer ...