There is always a trade-off between time complexity and space complexity for computer programs. Deceasing the time cost will increase space cost, and vice versa, The ideal solution to parallelize the program to multiple cores if there is a multiple-core computer, or even scale it out to multiple machines across a cluster, which would eventually reduce both time complexity and space complexity.

Spark is currently the hottest platform for cluster computing on top of Hadoop, and its Python interface provides

`map`

, `reduce`

and many other methods, which allow a mapRecdue job in a straightforward way, and therefore easily migrate an algorithm from a single machine to a cluster of many machines. - Minimize space complexity

There is a question to look for the only single number from a mostly paired-number array.

Single Number`Given an array of integers, every element appears twice except for one.`

Find that single one.

Note:

Your algorithm should have a linear runtime complexity.

Could you implement it without using extra memory?

The optimal space complexity for this question is

`O(1)`

by using the bit manipulator `xor`

. For a cluster, since Spark aggregates memory acrosss machines, the space complexity may become `O(1/k)`

, where k is the number of the machines in the cluster.`# Space.py`

import pyspark

from random import shuffle

from operator import xor

sc = pyspark.Spark.Context()

# Create the test case and the target is 99

testCase = range(0, 99) + range(0, 100)

shuffle(testCase)

# Run the testing with Spark

result = sc.parallelize(testCase).reduce(xor)

# Show the result

print result

sc.stop()

- Minimize time complexity

There is a question to implement the function (or a method) that returns the square root of an integer.

Sqrt(x)`Implement int sqrt(int x).`

Compute and return the square root of x.

The optimal solution could achieve the time complexity of

`O(lgN)`

by using binary search. If we pass the `sqrt`

function to Spark, then the time complexity will decreased to `O(lgN/k)`

, where k is the number of the machines in the cluster.`# Time.py`

import pyspark

sc = pyspark.Spark.Context()

# Implement binary search for square root function

def sqrt(x):

if x < 0 or not isinstance(x, int):

raise ValueError

hi, lo = x/2 + 1, 0

while hi >= lo:

mid = (hi + lo) / 2

if mid * mid > x:

hi = mid - 1

else:

lo = mid + 1

return int(hi)

# Test the square root algorithm

testCase = sc.parallelize(xrange(1, 100))

result = testCase.map(lambda x: sqrt(x))

# Show the result

for x in result.collect():

print x

sc.stop()

- Find the worst rating by accounts

There is a question to find the worst one among a few rating letters for each of the account numbers.

Want to find the worst rating for each account number.sample.txt is below`Account_number Rating`

1 A

1 B

2 A

2 B

2 C

3 A

3 Cthe desired result should be like`1 B`

2 C

3 C

The question is essentially one of the grouping questons. Spark’s pair RDD, which reflects the key-value relationship for groups, supplies a one-line solution for it.

`import pyspark`

sc = pyspark.SparkContext()

# Squeeze the letters by keys

rdd = sc.textFile('sample.txt')

result = rdd.map(lambda x: x.split()).filter(x: x[0].isdigit()).reduceByKey(max)

# Show the result

for x in result.collect():

print x

sc.stop()

In a conclusion, Spark significantly changes the way we think about data analysis.