When ROC fails logistic regression for rare-event data ROC or AUC is widely used in logistic regression or other classification methods for model comparison and feature selection, which measures the trade-off between sensitivity and specificity. The paper by Gary King warns the dangers using logistic regression for rare event and proposed a penalized likelihood estimator. In PROC LOGISTIC, the FIRTH option implements this penalty concept.When the event in the response variable is rare, the ROC curve will be dominated by minority class and thus insensitive to the change of true positive rate, which provides litter information for model diagnosis. For example, I construct a subset of SASHELP.CARS with the response variable Type including ...
Test drive for PROC HADOOP and Pig PROC HADOOP is available since SAS 9.3M2, which bridges a Windows client and a Hadoop server. The great thing about this procedure is that it supports user-defined function. There are several steps to apply this procedure.Download Java SE and Eclipse on WindowsJava SE and Eclipse are free to download. Installation is also fairly easy.Make user-defined function on WindowsThe most basic user-defined function is an upper-case function for a string that wraps Java’s native str.toUpperCase() function. Pig’s manual has [detail descripton][1] about it.Package the function as JARThere is a wonderful video tutorial on YouTube. Make sure that version of the [Pig API][2] with ...
An alternative way to use SAS and Hadoop together The challenges for SAS in HadoopFor analytics tasks on the data stored on Hadoop, Python or R are freewares and easily installed in each data node of a Hadoop cluster. Then some open source frameworks for Python and R, or the simple Hadoop streaming would utilize the full strength of them on Hadoop. On the contrary, SAS is a proprietary software. A company may be reluctant to buy many yearly-expired licenses for a Hadoop cluster that is built on cheap commodity hardwares, and a cluster administrator will feel technically difficult to implement SAS for hundreds of the nodes. Therefore, the traditional ETL pipeline to pull ...
PROC PLS and multicollinearity Multicollinearity and its consequencesMulticollinearity usually brings significant challenges to a regression model by using either normal equation or gradient descent.1. Invertible SSCP for normal equationAccording to normal equation, the coefficients could be obtained by . If the SSCP turns to be singular and non-invertible due to multicollinearity, then the coefficients are theoretically not solvable.2. Unstable solution for gradient descentThe gradient descent algorithm seeks to use iterative methods to minimize residual sum of squares (RSS). For example, as the plot above shows, if there is strong relationship between two regressors in a regression, many possible combinations of  and  lie along a narrow valley, which all ...
Use R in Hadoop by streaming It seems that the combination of R and Hadoop is a must-have toolkit for people working with both statistics and large data set.An aggregation exampleThe Hadoop version used here is Cloudera’s CDH4, and the underlying Linux OS is CentOS 6. The data used is a simulated sales data set form a training course by Udacity. Format of each line of the data set is: date, time, store name, item description, cost and method of payment. The six fields are separated by tab. Only two fields, store and cost, are used to aggregate the cost by each store.A typical MapReduce job contains two ...