Crawler爬虫類まとめ

6591 ワード

Crawler小さな爬虫類、まとめ
pythonデータ型

文字列、一重引用符二重引用符-普通文字列、三重引用符-跨行長文字列

str = 'this is string'
str = "this is also a string"
str = '''
        this is a long string
        which inclode many sustring
        and multiple lines
        '''

リストは、中括弧[]で表され、各種データ型のデータ

を加えることができる.

    list = [1, 2 ,3, 4 ,5]
    multipleTypeList = ['123', 123, otherType]

タプル、定義されたタプルの中の修正はできませんが、delで削除することができ、+で接続し、*でタプルをコピーし、()で

を表す.

tuple = (1, 2, 3, 4, 5)
multipleTypeTuple = (1, 2, '123', otherType)

辞書、無秩序なオブジェクトの集合、他の言語のmapに相当し、関連配列またはハッシュテーブル、またキーと対応する値からなり、キーによって値を取り、キーはユニークな

でなければならない.

dict = {'Alice': '2341', 'Beth': '9102', 'Cecil': '3258'}
multipleDic = {'1' : 1, '2' : '123'}
#         
dict.clear()
#       ，          default 
dic.get(key, default=None)
#      dict   true，    false
dict.has_key(key) 
#         ( ,  )     
dict.items() 
#             
dict.keys() 
#   dict2  /     dict 
dict.update(dict2) 
#            
dict.values()

set集合

set = set()
set.add(data)
#           
set.pop()

Queue,キュー

import Queue
myqueue = Queue.Queue(maxsize = 10)
myqueue.put(10)
#            
myqueue.get()

#python queue   FIFO      。
class Queue.Queue(maxsize) FIFO
#LIFO    。     。
class Queue.LifoQueue(maxsize) LIFO
#                  。
class Queue.PriorityQueue(maxsize)

すべての特殊文字列を削除し、正規表現re.sub[\s+\.\!\/_,$%^*(+\"\']+|[+——！，。？、~@#￥%……&*（）], 'replaceString', 'contentString'

を使用します.

#-*-coding:utf-8-*-
import re
temp = "  /  _ /  _/   、 , Q：  1 5.  8 0. ！！？？  8 6 。0.  2。 3      , , , "
temp = temp.decode("utf8")
string = re.sub("[\s+\.\!\/_,$%^*(+\"\']+|[+——！，。？、~@#￥%……&*（）]+".decode("utf8"), "".decode("utf8"),temp)
print string #          Q158086023

対のデータがNoneであるかどうか、文字列の比較サイズ、等しい==:

type #        NoneType     
if type is None :
    pass

string = ''
if string == '':
    pass

異常処理

try:
    pass
except Exception,e:
    e.args
    e.message
    str(e)

符号化問題

asciiとutf-8の符号化置換は、ほとんどがutf-8文字を使用するため、pythonはデフォルトでasciiで記述されるため、文字化けしの問題が発生する

中国語文字が書かれたファイルの場合、codecs指定符号で

に書き込むことができます.

csvファイルを書き込む際にcsvファイルをutf-8

と指定する必要がある.

# Unicode  
string = u'   '
#  encode   Unicode    
string.encode('utf-8')
#  decode       Unicode   
string.decode('utf-8')

#             list，set，dic         ，       __repr__()，
#     pring    ，      Unicode  ，
#     ，     Unicode，              
#        ，   list，set    ，dic           


#       utf-8  
 writeFile = codecs.open(fileName, 'w', "utf-8")
 writeFile.write(content)
 writeFile.close()

 #   csv  
 f = open(fileName, 'w')
 #     utf8
 f.write(codecs.BOM_UTF8)
 #   
 f.writerow(content)
 #   
 f.writerows(content)

 f.close()

AttributeError;'Series'object has no attribute'split'解決

python-shelveモジュール