Pyhton科学計算ツールPandas(六)——テキストデータ処理


Pyhton科学計算ツールPandas(六)——テキストデータ処理
Pandasは文字列に対して配列の各要素を操作しやすくする一連の方法を備えている.
文字列の一般的な方法
文字数、損失/NA値の自動除外
#   str  ,       / NA 

s = pd.Series(['A','b','C','bbhello','123',np.nan,'hj'])
df = pd.DataFrame({'key1':list('abcdef'),
                  'key2':['hee','fv','w','hija','123',np.nan]})
print(s)
print(df)
print('-----')

print(s.str.count('b'))  
0          A
1          b
2          C
3    bbhello
4        123
5        NaN
6         hj
dtype: object
  key1  key2
0    a   hee
1    b    fv
2    c     w
3    d  hija
4    e   123
5    f   NaN
-----
0    0.0
1    1.0
2    0.0
3    2.0
4    0.0
5    NaN
6    0.0
dtype: float64

文字列の大文字と小文字
s = pd.Series(['A','asd','123',np.nan])
print(s)
print('-----')
print(s.str.lower(),'→ lower  
'
) print(s.str.upper(),'→ upper
'
) # NAN
0      A
1    asd
2    123
3    NaN
dtype: object
-----
0      a
1    asd
2    123
3    NaN
dtype: object → lower  

0      A
1    ASD
2    123
3    NaN
dtype: object → upper  

文字列長および立ち上がり判定
s = pd.Series(['A','b','bbhello','123',np.nan])
print(s)
print('-----')
print(s.str.len(),'→ len    
'
) print(s.str.startswith('b'),'→ a
'
) print(s.str.endswith('3'),'→ 3
'
)
0          A
1          b
2    bbhello
3        123
4        NaN
dtype: object
-----
0    1.0
1    1.0
2    7.0
3    3.0
4    NaN
dtype: float64 → len    

0    False
1     True
2     True
3    False
4      NaN
dtype: object →        a

0    False
1    False
2    False
3     True
4      NaN
dtype: object →        3

文字列削除
  • .strip()両端のスペース
  • を削除
  • .lstrip()左のスペース
  • を削除
  • .rstrip()右側のスペース
  • を削除
    #           .strip()   ,  .lstrip()  ,  .rstrip()
    
    
    s = pd.Series(['  jack', 'jill  ', '  jesse  ', 'frank'])
    df = pd.DataFrame(np.random.randn(3, 2), columns=[' Column A ', ' Column B '],
                      index=range(3))
    print(s)
    print(df)
    print('-----')
    
    print(s.str.strip().values)  #          
    print(s.str.lstrip().values)  #           
    print(s.str.rstrip().values)  #           
    print('-----')
    
    
    df.columns = df.columns.str.strip()
    print(df)
    #      columns     ,         
    0         jack
    1       jill  
    2      jesse  
    3        frank
    dtype: object
        Column A    Column B 
    0    0.807158   -0.759207
    1   -0.380771   -0.816461
    2    0.160034    1.014544
    -----
    ['jack' 'jill' 'jesse' 'frank']
    ['jack' 'jill  ' 'jesse  ' 'frank']
    ['  jack' 'jill' '  jesse' 'frank']
    -----
       Column A  Column B
    0  0.807158 -0.759207
    1 -0.380771 -0.816461
    2  0.160034  1.014544
    

    文字列の置換
    #        (3) - replace
    
    df = pd.DataFrame(np.random.randn(3, 2), columns=[' Column A ', ' Column B '],
                      index=range(3))
    df.columns = df.columns.str.replace(' ','-')
    print(df)
    #   
    
    df.columns = df.columns.str.replace('-','¥',n=2)
    print(df)
    # n:    
       -Column-A-  -Column-B-
    0    1.097967    0.017149
    1   -0.079448    0.603613
    2   -0.205197    0.654724
       ¥Column¥A-  ¥Column¥B-
    0    1.097967    0.017149
    1   -0.079448    0.603613
    2   -0.205197    0.654724
    

    文字列の分割
    #       (4) - split、rsplit
    
    s = pd.Series(['a,b,c','1,2,3',['a,,,c'],np.nan])
    print(s)
    print('-----')
    
    print(s.str.split(','))  #        list
    print('----')
    
    print(s.str.split(',')[0])
    print(s.str.split(',').str.get(1))
    print('-----')
    #     get []            
    
    print(s.str.split(',', expand=True))
    print(s.str.split(',', expand=True, n = 1))
    print(s.str.rsplit(',', expand=True, n = 1))
    print('-----')
    #     expand            DataFrame
    # n       
    # rsplit   split,    ,               
    
    df = pd.DataFrame({'key1':['a,b,c','1,2,3',[':,., ']],
                      'key2':['a-b-c','1-2-3',[':-.- ']]})
    print(df['key2'].str.split('-'))
    # Dataframe  split
    0      a,b,c
    1      1,2,3
    2    [a,,,c]
    3        NaN
    dtype: object
    -----
    0    [a, b, c]
    1    [1, 2, 3]
    2          NaN
    3          NaN
    dtype: object
    ----
    ['a', 'b', 'c']
    0      b
    1      2
    2    NaN
    3    NaN
    dtype: object
    -----
         0    1    2
    0    a    b    c
    1    1    2    3
    2  NaN  NaN  NaN
    3  NaN  NaN  NaN
         0    1
    0    a  b,c
    1    1  2,3
    2  NaN  NaN
    3  NaN  NaN
         0    1
    0  a,b    c
    1  1,2    3
    2  NaN  NaN
    3  NaN  NaN
    -----
    0    [a, b, c]
    1    [1, 2, 3]
    2          NaN
    Name: key2, dtype: object
    

    文字列索引
    #      
    
    s = pd.Series(['A','b','C','bbhello','123',np.nan,'hj'])
    df = pd.DataFrame({'key1':list('abcdef'),
                      'key2':['hee','fv','w','hija','123',np.nan]})
    
    print(s.str[0])  #        
    print(s.str[:2])  #        
    print(df['key2'].str[0]) 
    # str