Deploy a minimal Spark cluster RequirementsSince Spark is rapidly evolving, I need to deploy and maintain a minimal Spark cluster for the purpose of testing and prototyping. A public cloud is the best fit for my current demand. Intranet speedThe cluster should easily copy the data from one server to another. MapReduce always shuffles a large chunk of data throughout the HDFS. It’s best that the hard disk is SSD.Elasticity and scalabilityBefore scaling the cluster out to more machines, the cloud should have some elasticity to size up or size down. Locality of HadoopMost importantly, the Hadoop cluster and the Spark cluster should have one-to-one ...
Solve the Top N questions in SAS/SQL This is a following post after my previous post about SAS/SQL. SAS’s SQL procedure has a basic SQL syntax. I found that the most challenging work is to use PROC SQL to solve the TOP N (or TOP N by Group) questions. Comparing with other modern database systems, PROC SQL is lack of -The ranking functions such as RANK() or the SELECT TOP clause such asselect TOP 3 * from class;The partition by clause such as select sex, name, weightfrom (select sex, name, max(weight) over(partition by sex) max_weight from class)where weight = max_weight;However, there are always some alternative solutions in ...
Deploy a MongoDB powered Flask app in 5 minutes This is a quick tutorial to deploy a web service (a social network) by the LNMP (Linux, Nginx, MongoDB, Python) infrastructure on any IaaS cloud. The repo at Github is at https://github.com/dapangmao/minitwit-mongo-ubuntu. StackThe stack is built on the tools in the ecosystem of Python below. ToolNameAdvantageCloudDigitalOceanCheap but fastServer distroUbuntu 14.10 x64Everything is latestWSGI proxyGunicornManage workers automaticallyWeb proxyNginxFast and easy to configureFrameworkFlaskSingle file approach for MVCData storeMongoDBNo scheme needed and scalableDevOpsFabricAgentless and PythonicIn addition, a Supervisor running on the server provides a daemon to protect the Gunicorn-Flask process. The MiniTwit appThe MiniTwit application is an example provided by Flask, which is a ...
Spark practice(4): malicious web attack Suppose there is a website tracking user activities to prevent robotic attack on the Internet. Please design an algorithm to identify user IDs that have more than 500 clicks within any given 10 minutes.Sample.txt: anonymousUserID timeStamp clickCount123 9:45am 10234 9:46am 12234 9:50am 20456 9:53am 100123 9:55am 33456 9:56am 312123 10:03am 110123 10:16am 312234 10:20am 201456 10:23am 180123 10:25am 393456 10:27am 112999 12:21pm 888ThoughtThis is a typical example of stream processing. The key is to build a fixed-length window to slide through all data, count data within and return the possible malicious IDs.Single machine solutionTwo data structures are used: a queue ...
Spark practice (3): clean and sort Social Security numbers Sample.txtRequirements:1. separate valid SSN and invalid SSN2. count the number of valid SSN402-94-7709 283-90-3049 124-01-2425 1231232088-57-9593 905-60-3585 44-82-8341257581087327-84-0220402-94-7709ThoughtsSSN indexed data is commonly seen and stored in many file systems. The trick to accelerate the speed on Spark is to build a numerical key and use the sortByKey operator. Besides, the accumulator provides a global variable existing across machines in a cluster, which is especially useful for counting data. Single machine solution#!/usr/bin/env python# coding=utf-8htable = {}valid_cnt = 0with open('sample.txt', 'rb') as infile, open('sample_bad.txt', 'wb') as outfile: for l in infile: l = l.strip() nums = l.split('-') key = -1 if l.isdigit() ...