The line-by-line feature in Python allows it to count hard disk-bound data. The most frequently used data structures in Python are list and dictionary. Many cases the dictionary has advantages since it is a basically a hash table that many realizes O(1) operations.

However, for the tasks of counting values, the two options make no much difference and we can choose any of them for convenience. I listed two examples below.

There is a question to count the strings in Excel.

Count the unique values in one column in EXCEL 2010. The worksheet has 1 million rows and 10 columns.

or numbers.

For example,

A5389579_10

A1543848_6

A5389579_8

Need to cut off the part after (including) underscore such as from A5389579_10 to A5389579

Commonly Excel on a desktop can’t handle this size of data, while Python would easily handle the job.

`# Load the Excel file by the xlrd package`

import xlrd

book = xlrd.open_workbook("test.xlsx")

sh = book.sheet_by_index(0)

print sh.name, sh.nrows, sh.ncols

print "Cell D30 is", sh.cell_value(rowx=29, colx=3)

# Count the unique values in a dictionary

c = {}

for rx in range(sh.nrows):

word = str(sh.row(rx)[1].value)[:-3]

try:

c[word] += 1

except:

c[word] = 1

print c

There is a question to count emails.

The challenge is to scale up the small sample data to larger size. The solution I have has the complexity of O(nlogn), which is only limited by the sorting step.A 3-column data set includes sender, receiver and timestamp. How to calculate the time between the sender sends the email

and the receiver sends the reply email?

`raw_data = """`

SENDER|RECEIVER|TIMESTAMP

A B 56

A A 7

A C 5

C D 9

B B 12

B A 8

F G 12

B A 18

G F 2

A B 20

"""

# Transform the raw data to a nested list

data = raw_data.split()

data.pop(0) # Remove the Head

data = zip(data[0::3], data[1::3], map(lambda x: int(x), data[2::3]))

# Sort the nested list by the timestamp

from operator import itemgetter

data.sort(key=itemgetter(2))

for r in data:

print r

# Count the time difference in a list

c = []

while len(data) != 1:

y = data.pop(0)

for x in data:

if x[0] == y[1] and x[1] == y[0]:

diff = x[2] - y[2]

print y, x, '---->', diff

c.append(diff)

break # Only find the quickest time to respond

print c

I come up with the O(n) solution below, which utilizes two hash tables to decrease the complexity.

`__author__ = 'dapangmao'`

def find_duration(data):

# Construct two hash tables

h1 = {}

h2 = {}

# Find the starting time for each ID pair

for x in data:

if x[0] != x[1]:

key = x[0] + x[1]

try:

h1[key] = x[2]

except:

h1[key] = min(h1[key], x[2])

# Find the minimum duration for each ID pair

for x in data:

key = x[1] + x[0]

if h1.has_key(key):

duration = x[2] - h1[key]

try:

h2[key] = duration

except:

h2[key] = min(h2[key], duration)

return h2

if __name__ == "__main__":

raw_data = """

SENDER|RECEIVER|TIMESTAMP

A B 56

A A 7

A C 5

C D 9

B B 12

B A 8

F G 12

B A 18

G F 2

A B 20

"""

# Transform the raw data to a nested list

data = raw_data.split()

data.pop(0) # Remove the Head

data = zip(data[0::3], data[1::3], map(lambda x: int(x), data[2::3]))

# Verify the result

print find_duration(data)