Pythonであるディレクトリ以下のファイル全てに対して特定の文字列があるかチェックして対象行を出力

8596 ワード

文字コード Python encoding grep 文字列処理 Python テキストリンク

概要

[DIR_NAME]以下ファイル全てを対象に、
[TARGET_ENCODING_LIST]に定義されている文字コードのテキストファイルかチェックして、
テキストファイルなら、[SEARCH_WORD]があるか検索して、
結果を、[OUTPUT_NAME]のファイル名に出力します。

環境

Windows8＋Python2.6系

コード

find_directory.py

#!/usr/bin/python
# -*- coding: utf-8 -*-
# vim: fileencoding=utf-8

import os , sys , codecs

DIR_NAME = 'C:\\html\\HOGE\\'
OUTPUT_NAME = 'result_find_file_list.csv'

SEARCH_WORD = '<font'

TARGET_ENCODINGS = [
    'utf-8',
    'shift-jis',
    'euc-jp',
    'iso2022-jp'
]

FLAG_STDOUT = True
#FLAG_STDOUT = False

import os, sys

write = sys.stdout.write

def guess_charset(data):
    file = lambda d, encoding: d.decode(encoding) and encoding
    for enc in TARGET_ENCODINGS:
        try:
            file(data, enc)
            return enc
        except:
            pass
    return 'binary'

out = codecs.open(OUTPUT_NAME, 'w', 'shift-jis')
out.write('path,line_number,search,target_line\n')

for dirpath, dirs, files in os.walk(DIR_NAME):
    for fn in files:
        path = os.path.join(dirpath, fn)
        fobj = file(path, 'rU')
        data = fobj.read()
        fobj.close()
        try:
            enc = guess_charset(data)
        except:
            continue
        if enc == 'binary':
            continue
        count = 0
        try:
            for l in codecs.open(path, 'r', enc):
                count = count + 1
                if SEARCH_WORD in l:
                    output = ''
                    try:
                        output = '"' + path + '","' + str(count) + '","' + SEARCH_WORD + '","' + l.replace('"',"'").replace('\r','').replace('\n','') + '"\r\n'
                    except:
                        continue
                    if FLAG_STDOUT == True:
                        write(output)
                    out.write(output)
        except:
            continue

補足

例によって、例外処理は、適当です。
いろいろリファクタリングの余地ありですが、
明日実戦投入したいので、一旦、このまま投稿

Author And Source

この問題について(Pythonであるディレクトリ以下のファイル全てに対して特定の文字列があるかチェックして対象行を出力), 我々は、より多くの情報をここで見つけました https://qiita.com/selious/items/36420298c561629925b1

著者帰属：元の著者の情報は、元のURLに含まれています。著作権は原作者に属する。

Content is automatically searched and collected through network algorithms . If there is a violation . Please contact us . We will adjust (correct author information ,or delete content ) as soon as possible .

割り込みサービスサブルーチン(ISR)

C-Good String(Round 92 div 2列挙)