SAS vs. Python for data analysis To perform data analysis efficiently, I need a full stack programming language rather than frequently switching from one language to another. That means — this language can hold large quantity of data, manipulate data promptly and easily (e.g. if-then-else; iteration), connect to various data sources such as relational database and Hadoop, apply some statistical models, and report result as graph, table or web. SAS is famous for its capacity to realize such a data cycle, as long as you are willing to pay the annual license fee.SAS’s long-standing competitor, R, still keeps growing. However, in the past years, the Python ...
Use Python to normalize database Many occasions, data needs to be normalized to speed up query operations before entering a database. Large text files have to depend on Python, given its’ excellent row-wise data manipulation ability.First thought is to use a nested list to fill in all the data, such as the codes below. import csv, sqlite3infile = open('mtcars.csv', 'r')f = csv.reader(infile)header = f.next()header.pop(0)data = [] for r in f: name = r.pop(0) for i in range(0, len(r)): data.append([name, header[i], r[i]])However, a dictionary will be much more convenient given its built-in iteration tools. import csv, sqlite3infile = open('mtcars.csv', 'r')f = csv.DictReader(infile)data = [] for r ...
10 popular Linux commands for Hadoop The Hadoop system has its unique shell language, which is called FS. Comparing with the common Bash shell within the Linux ecosystem, the FS shell has much fewer commands. To deal with the humongous size of data distributively stored at the Hadoop nodes, in my practice, I have 10 popular Linux command to facilitate my daily work.1. sortA good conduct of running Hadoop is to always test the map/reduce programs at the local machine before releasing the time-consuming map/reduce codes to the cluster environment. The sort command simulates the sort and shuffle step necessary for the map/redcue process. For example, I can run ...
Top 10 tips and tricks about PROC SQL INTRODUCTIONPROC SQL is the implementation of the SQL syntax in SAS. It first appeared in SAS 6.0, and since then has been very popular for SAS users. SAS ships with a few sample data sets in its HELP library, and SASHELP.CLASS is one of them. This dataset contains 5 variables including name, weight, height, sex and age for 19 simulated teenagers, and in this paper I primarily use it for the demonstration purpose. Here I summarize the 10 interesting tricks and tips using PROC SQL. At the beginning, I first make a copy of SASHELP.CLASS at the WORK library and ...
Sortable tables in SAS This is an update of my previous post Make all SAS tables sortable in the output HTML. Previously I manually added the sortable plugin to the SAS output. With the PREHTML statement of PROC TEMPLATE, the sortable HTML template now can be automately saved for the future use./* 0 -- Create the sortable HTML template */proc template; define style sortable; parent=styles.htmlblue; style body from body / prehtml=' <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.9.0/jquery.min.js"></script> <script src="http://cdn.jsdelivr.net/tablesorter/2.0.5b/jquery.tablesorter.min.js"></script> <script> $(document).ready(function( ) { $(".table").tablesorter({widgets: ["zebra"]}); }); </script> '; end;run;/* 1 -- Make all the tables sortable */ods html file = 'tmp.html' style = sortable;proc reg data=sashelp.class; model weight = height age;run;proc print data=sashelp.class;run;While we ...