Transform SAS files to Parquet through Spark The demo pipeline is at GitHub.Since the version 1.3, Spark has introduced the new data structure DataFrame. A data analyst now could easily scale out the exsiting codes based on the DataFrame from Python or R to a cluster hosting Hadoop and Spark.There are quite a few practical scenarios that DataFrame fits well. For example, a lot of data files including the hardly read SAS files want to merge into a single data store. Apache Parquet is a popular column store in a distributed environment, and especially friendly to structured or semi-strucutred data. It is an ideal candidate for a ...
Two alternative ways to query large dataset in SAS I really appreciate those wonderful comments on my SAS posts by the readers (1, 2, 3). They gave me a lot of inspirations. Due to SAS or SQL’s inherent limitation, recently I feel difficult in deal with some extremely large SAS datasets (it means that I exhausted all possible traditional ways). Here I conclude two alternative solutions in these extreme cases as a follow-up to the comments.Read DirectlyUse a scripting language such as Python to Reading SAS datasets directlyCode GeneratorUse SAS or other scripting languages to generate SAS/SQL codesThe examples still use sashelp.class, which has 19 rows. The target variable is weight.*In SASdata class; ...
A cheat sheet for linear regression validation The link of the cheat sheet is here.I have benefited a lot from the UCLA SAS tutorial, especially the chapter of regression diagnostics. However, the content on the webpage seems to be outdated. The great thing for PROC REG is that it creates a beautiful and concise 3X3 plot panel for residual analysis.I created a cheat sheet to try to interpret the diagnosis panel. Thanks to the ESS project, the BASEBALL data set used by SAS is available for public. Hereby I borrowed this data set as an example, and the cheat sheet also contains the data and the SAS program. The regression model attempts ...
saslib: a simple Python tool to lookup SAS metadata saslib is an HTML report generator to lookup the metadata (or the head information) like PROC CONTENTS in SAS.It reads the sas7bdat files directly and quickly, and does not need SAS installed.Emulate PROC CONTENTS by jQuery and DataTables.Extract the meta data from all SAS7bdat files under the specified directory.Support IE(>=10), firefox, chrome and any other modern browser.Installationpip install saslibsaslib requires sas7bdat and jinjia2.UsageThe module is very simple to use. For example, the SAS data sets under the SASHELP library could be viewed —from saslib import PROCcontentssasdata = PROCcontents('c:/Program Files/SASHome/SASFoundation/9.3/core/sashelp')sasdata.show()The resulting HTML file from the codes above will be like here.
Deploy a minimal Spark cluster RequirementsSince Spark is rapidly evolving, I need to deploy and maintain a minimal Spark cluster for the purpose of testing and prototyping. A public cloud is the best fit for my current demand. Intranet speedThe cluster should easily copy the data from one server to another. MapReduce always shuffles a large chunk of data throughout the HDFS. It’s best that the hard disk is SSD.Elasticity and scalabilityBefore scaling the cluster out to more machines, the cloud should have some elasticity to size up or size down. Locality of HadoopMost importantly, the Hadoop cluster and the Spark cluster should have one-to-one ...