python处理海量数据_加速处理海量数据的Python文件

我有一个大数据集存储为一个17GB的csv文件(fileData),其中包含可变数量的记录(最多30个,000),我试图搜索特定客户(列在fileSelection-总共90000个客户中的1500个),并将每个客户的记录复制到一个单独的csv文件(fileOutput)。在

我对Python很陌生,但使用它是因为vba和matlab(我更熟悉)不能处理文件大小。(我使用aptanastudio编写代码,但是为了提高速度,直接从cmd行运行python。运行64位Windows 7。)

我编写的代码提取了一些客户,但有两个问题:

1) 它在大型数据集中找不到大多数客户。(我相信它们都在数据集中,但不能完全确定。)

2) 它很慢。如果能更好地利用核心代码,那就更好了

代码如下:`def main():

# Initialisation :

# - identify columns in slection file

#

fS = open (fileSelection,"r")

if fS.mode == "r":

header = fS.readline()

selheaderlist = header.split(",")

custkey = selheaderlist.index('CUSTOMER_KEY')

#

# Identify columns in dataset file

fileData = path2+file_data

fD = open (fileData,"r")

if fD.mode == "r":

header = fD.readline()

dataheaderlist = header.split(",")

custID = dataheaderlist.index('CUSTOMER_ID')

fD.close()

# For each customer in the selection file

customercount=1

for sr in fS:

# Find customer key and locate it in customer ID field in dataset

selrecord = sr.split(",")

requiredcustomer = selrecord[custkey]

#Look for required customer in dataset

found = 0

fD = open (fileData,"r")

if fD.mode == "r":

while found == 0:

dr = fD.readline()

if not dr: break

datrecord = dr.split(",")

if datrecord[custID] == requiredcustomer:

found = 1

# Open outputfile

fileOutput= path3+file_out_root + str(requiredcustomer)+ ".csv"

fO=open(fileOutput,"w+")

fO.write(str(header))

#copy all records for required customer number

while datrecord[custID] == requiredcustomer:

fO.write(str(dr))

dr = fD.readline()

datrecord = dr.split(",")

#Close Output file

fO.close()

if found == 1:

print ("Customer Count "+str(customercount)+ " Customer ID"+str(requiredcustomer)+" copied. ")

customercount = customercount+1

else:

print("Customer ID"+str(requiredcustomer)+" not found in dataset")

fL.write (str(requiredcustomer)+","+"NOT FOUND")

fD.close()

fS.close()

`

花了几天时间才找到几百个客户,但没有找到更多的客户。在

谢谢@Paul Cornelius。这样效率更高。我采用了您的方法,也使用了@Bernardo建议的csv处理:

^{pr2}$

你可能感兴趣的:(python处理海量数据)