爬虫類データの分類とjsonデータ抽出

11745 ワード

データ抽出の概念とデータの分類
学習目標
爬虫類のデータの分類について
1爬虫類におけるデータの分類
爬虫類が這い出すデータには多くの異なるタイプのデータがあり、データの異なるタイプを理解して規則的にデータを抽出し、解析する必要がある.

構造化データ:json,xmlなど

処理方式:pythonタイプ

に直接変換

非構造化データ:HTML

処理方式:正規表現、xpath

爬虫類におけるデータ分類の構造化データ:json,xml

爬虫類におけるデータ分類の非構造化データ:Html,文字列

構造化データ処理の方式はjsonpath,xpath,変換pythonタイプ処理,bs 4

である.

非構造化データ処理方式は、正規表現、xpath、bs 4

である.
jsonのデータ抽出
学習目標

jsonに関する方法(load loads dump dumps)

を把握する

jsonpathの使用(jsonのデータ抽出)

を理解する
2 jsonとは何かを復習する
JSON(JavaScript Object Notition)は軽量レベルのデータ交換フォーマットであり、人々が読みやすく書くことができる.同時に機械の解析と生成を容易にした.Webサイトのフロントとバックグラウンド間のデータインタラクションなど、データインタラクションを行うシーンに適しています.
3 jsonモジュールにおけるメソッドの学習
クラスファイルオブジェクトの理解:
read()またはwrite()メソッドを持つオブジェクトがクラスファイルオブジェクトであり、例えばf=open(「a.txt」)fがクラスファイルオブジェクトである
使用方法:

import json

mydict = {
    "store": {
        "book": [
            {"category": "reference",
             "author": "Nigel Rees",
             "title": "Sayings of the Century",
             "price": 8.95
             },
            {"category": "fiction",
             "author": "Evelyn Waugh",
             "title": "Sword of Honour",
             "price": 12.99
             },
        ],
    }
}

# json.dumps   python     json   
# indent       
# ensure_ascii=False               
json_str = json.dumps(mydict, indent=2, ensure_ascii=False)
print('json.dumps python_type-->json_str: {}'.format(type(json_str)))

# json.loads   json      python     
my_dict = json.loads(json_str)
print('json.loads json_str-->python_type: {}'.format(type(my_dict)))

# json.dump    python         
with open("json      .txt", "w") as f:
    json.dump(mydict, f, ensure_ascii=False, indent=2)
input('json.dump        ')

# json.load          json      python  
with open("json      .txt", "r") as f:
    my_dict = json.load(f)
    print('json.load     --> {}: {}'.format(type(my_dict), my_dict))

4 jsonpathモジュールの学習
4.1 jsonpath紹介
多層ネストjsonデータを解析するために使用される.JsonPathは情報抽出クラスライブラリで、JSONドキュメントから指定された情報を抽出するツールで、Javascript、Python、PHP、Javaなど多くの言語実装バージョンを提供しています.
4.2 JsonPath JSONにとってXPathはXMLに相当する.

        ：pip install jsonpath

        ：http://goessner.net/articles/JsonPath

4.3 JsonPath構文:
4.4構文の使用例

book_dict = { 
  "store": {
    "book": [ 
      { "category": "reference",
        "author": "Nigel Rees",
        "title": "Sayings of the Century",
        "price": 8.95
      },
      { "category": "fiction",
        "author": "Evelyn Waugh",
        "title": "Sword of Honour",
        "price": 12.99
      },
      { "category": "fiction",
        "author": "Herman Melville",
        "title": "Moby Dick",
        "isbn": "0-553-21311-3",
        "price": 8.99
      },
      { "category": "fiction",
        "author": "J. R. R. Tolkien",
        "title": "The Lord of the Rings",
        "isbn": "0-395-19395-8",
        "price": 22.99
      }
    ],
    "bicycle": {
      "color": "red",
      "price": 19.95
    }
  }
}

from jsonpath import jsonpath

print(jsonpath(book_dict, '$..author')) #         False #     ，        False

JSONPath
Result $.store.book[*].author
storeのすべてのbookの作者$..author
すべての作者$.store.*
storeの下のすべての要素$.store..price
storeのすべてのコンテンツの価格$..book[2]
3冊目の本$..book[(@.length-1)] | $..book[-1:]
最後の本$..book[0,1] | $..book[:2]
最初の2冊の本.$..book[?(@.isbn)]
isbnのすべての数を取得$..book[?(@.price<10)]
10以上の本をすべて入手$..*
すべてのデータを取得
4.5コード例:
私たちは網都市JSONファイルを引っ張っています.http://www.lagou.com/lbs/getAllCitySearchLabels.jsonたとえば、すべての都市の名前のリストを取得し、ファイルに書き込みます.

import requests
import jsonpath
import json

#        json   
url = 'http://www.lagou.com/lbs/getAllCitySearchLabels.json'
headers = {"User-Agent": "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)"}
response =requests.get(url, headers=headers)
html_str = response.content.decode()

#  json        python  
jsonobj = json.loads(html_str)

#       ，    key name  
citylist = jsonpath.jsonpath(jsonobj,'$..name')

#     
with open('city_name.txt','w') as f:
    content = json.dumps(citylist, ensure_ascii=False)
    f.write(content)

小結

jsonの概念(JavaScript Object Notation)とjsonの作用データが対話するときのデータフォーマット

jsonモジュールで文字列とpythonタイプを操作する方法はdump,load

である.

jsonモジュールでファイルとpythonタイプを操作する方法はdumps,loads

である.

jsonpathモジュールのインストールpip install jsonpath

jsonpathの解析ルートノード:$

jsonpathの解析サブノード:.

学習日記-ESP 8266をAPモードに設定する方法(2020.7.1)

Python--集合(set)