Deploy edx spark environment to DigitalOcean This summer I took the Spark courses at edx CS100 and CS190, and had wonderful experience. The two classes apply a Vagrant virtual machine containing Spark and all teaching materials. There are two challenges with the virtual machine —The labs usually take long time to finish, say 8-10 hours. If the host machine is closed, the RDDs will be lost and the pipeline has to be run again.Some RDD operations take a lot computation/communication powers, such as groupByKey and distinct. Many of my 50k classmates complained about the waiting time. And my most used laptop is a Chromebook and doesn’t ...
Transform SAS files to Parquet through Spark The demo pipeline is at GitHub.Since the version 1.3, Spark has introduced the new data structure DataFrame. A data analyst now could easily scale out the exsiting codes based on the DataFrame from Python or R to a cluster hosting Hadoop and Spark.There are quite a few practical scenarios that DataFrame fits well. For example, a lot of data files including the hardly read SAS files want to merge into a single data store. Apache Parquet is a popular column store in a distributed environment, and especially friendly to structured or semi-strucutred data. It is an ideal candidate for a ...
Two alternative ways to query large dataset in SAS I really appreciate those wonderful comments on my SAS posts by the readers (1, 2, 3). They gave me a lot of inspirations. Due to SAS or SQL’s inherent limitation, recently I feel difficult in deal with some extremely large SAS datasets (it means that I exhausted all possible traditional ways). Here I conclude two alternative solutions in these extreme cases as a follow-up to the comments.Read DirectlyUse a scripting language such as Python to Reading SAS datasets directlyCode GeneratorUse SAS or other scripting languages to generate SAS/SQL codesThe examples still use sashelp.class, which has 19 rows. The target variable is weight.*In SASdata class; ...
A cheat sheet for linear regression validation The link of the cheat sheet is here.I have benefited a lot from the UCLA SAS tutorial, especially the chapter of regression diagnostics. However, the content on the webpage seems to be outdated. The great thing for PROC REG is that it creates a beautiful and concise 3X3 plot panel for residual analysis.I created a cheat sheet to try to interpret the diagnosis panel. Thanks to the ESS project, the BASEBALL data set used by SAS is available for public. Hereby I borrowed this data set as an example, and the cheat sheet also contains the data and the SAS program. The regression model attempts ...
saslib: a simple Python tool to lookup SAS metadata saslib is an HTML report generator to lookup the metadata (or the head information) like PROC CONTENTS in SAS.It reads the sas7bdat files directly and quickly, and does not need SAS installed.Emulate PROC CONTENTS by jQuery and DataTables.Extract the meta data from all SAS7bdat files under the specified directory.Support IE(>=10), firefox, chrome and any other modern browser.Installationpip install saslibsaslib requires sas7bdat and jinjia2.UsageThe module is very simple to use. For example, the SAS data sets under the SASHELP library could be viewed —from saslib import PROCcontentssasdata = PROCcontents('c:/Program Files/SASHome/SASFoundation/9.3/core/sashelp')sasdata.show()The resulting HTML file from the codes above will be like here.