TIL|Pythonの追加&Webスクリプトの作成#1

27848 ワード

python web scraper TIL テキストリンク

パイ

-とても美しいプログラミング言語です!初心者にもわかりやすい
-javaがWebに集中している場合、pythonはデータ科学、機械学習などの分野でより多くの通信を行うことができます.

n/a.理論

Pythonルール

文字(string)は「私」で表記されていますか?a = “like this”

Snake Case
:変数の命名に時間がかかる場合は、区切り記号を使用します.黙認する
(いずれも小文字で、1回ごとに)

False、Trueなどは文字を使わないで、例?c=False(最初のアルファベットは大文字でなければなりません)

💡 c=「false」の場合、false文字列として認識されます.

変数タイプの適用

a_string = "Like this"
a_number = 3
a_float = 3.12
a_boolean = False
a none=None(Pythonのみ存在)
float小数点数、boolean真&偽
変数(変数):情報の配置位置、データの格納位置(等号左側)

用語

int (integer)

bool (boolean)

str (string)

Sequence type

:リストされているlistのように、listはsequence typeの1つです.

list []
:かっことオブジェクト""の単位で区切る
ex. days = ["Mon","Tue","Wed","Thur","Fri","sat"]

tuple ()
:immutable→変更不可シーケンス

dictionary {}

🤔 辞書の例

nico = {"name": "Nico","age": 29,"korean": True,
"fav_food": ["Kimchi", "Sashimi"]}

print(nico["fav_food"])

関数＃カンスウ＃

関数は作成というより定義です.関数を定義する場合はdef(defineの略)で始まります.

関数を定義する場合は、入力したbodyにインデントまたはスペースキーの余白を残す必要があります.インデントがなければ、関数の主体にはなりません.

例

print(len(”lalsmfkdfljslfjsdlkkfjlsdf”)
:長さを印刷してください

🤔 default value定義例

def puls(a, b=0)
 print(a + b)

def minus(a=0, b)
 print(a - b)

plus(2)
minus(None, 2)

→パラメータ値が含まれていない場合はdefault valueと定義します.
🤔 return関数の使用例

def plus(a, b):
   return a + b

result = plus(2,4)
print(result)

注意事項

関数の下に返される実行文は実行されず、すぐに終了します.注意:

🤔 例を使用してstring変数に変更

def say_hello(name, age):
 return f"Hello {name} you are {age} years old"

hello = say_hello("nico", "12")
print(hello)

パラメータの割り当て方法

Keyword論点(良い方法)💜 )
:名前でペアリングします.ex.b=30, a=1

positional argument
:パラメータを順番に付与します.(位置別)

<コメントリンク>
https://docs.python.org/3/library/index.html

ドアが

<例1>

def plus(a, b):
	if type(b) is str:
		return None
	else:
		return a + b

# is -> object identity
# is not -> negated object identity

def plus(a, b):
	if type(b) is not or type(b) is float:
		return a + b
	else:
		return None

print(plus(12, 1.2))

文脈

-string、tupleまたはlistまたはその他のiterableオブジェクトが使用可能

days = ('Mon', 'Tue', 'Wed', 'Thu', 'Fri')

for i in days:
	if day is 'Wed':
		break
	else:	
		print(i)

Web scraperの作成

request package

:Python作成要求機能を統合
the power of Requests

r = requests.get('https://api.github.com/user', auth=('user', 'pass'))
r.status_code
200
r.headers['content-type']
'application/json; charset=utf8'
r.encoding
'utf-8'
r.text
'{"type":"User"...'
r.json()
{'private_gists': 419, 'total_private_repos': 77, ...}

beautiful soup4

-htmlから情報を抽出するのに非常に有用なライブラリです.必要なデータを簡単に類似のデータカテゴリに分類します.
-https://www.crummy.com/software/BeautifulSoup/bs4/doc/(正式な文書を参照)

1.request packageを使用してHTMLを抽出

import requests

# requests 라이브러리 사용해서 HTML페이지 요청 -> indeed_resul 객체에 HTML 데이터 저장
indeed_resul = requests.get('https://kr.indeed.com/jobs?q=python&l=%EC%84%9C%EC%9A%B8&vjk=1015284880e2ff62')

print(indeed_resul)
print(indeed_resul.text)  # 예시 1. html text 전부를 가져오고 싶은 경우

<Response [200]>   # ok 라는 뜻 
# text 전부 추출됨

2.Beatifulスープによるページング抽出

-まずページのhtml部分をチェックします

import requests
from bs4 import BeautifulSoup    # beatifulSoup 라이브러리 import 

indeed_resul = requests.get('https://kr.indeed.com/jobs?q=python&l=%EC%84%9C%EC%9A%B8&vjk=1015284880e2ff62')

# beautifulsoup4 라이브러리 사용해서 HTML 파싱하기
# soup = BeautifulSoup(html_doc, 'html.parser')
indeed_soup = BeautifulSoup(indeed_resul.text, "html.parser")

# HTML 파싱 후 ul 태그 가져오기
# find 메소드를 통해서 태그를 검색할 수 있음 -> 하나의 tag 찾음
pagination = indeed_soup.find("ul", {"class":"pagination-list"})

# find_all은 조건에 맞는 모든 tag를 리스트로써 찾아줌
links = pagination.find_all('a')

pages = []

# a 내의 자식관계로 있는 span을 찾기 위함
# 이미 pages가 리스트로 for문 사용 
for link in links:
  pages.append(link.find("span"))
pages = pages[:-1]

print(pages)

[<span class="pn">2</span>, <span class="pn">3</span>, <span class="pn">4</span>, <span class="pn">5</span>]

3.最後のページング値の検索

import requests
from bs4 import BeautifulSoup

indeed_resul = requests.get('https://kr.indeed.com/jobs?q=python&l=%EC%84%9C%EC%9A%B8&vjk=1015284880e2ff62')

indeed_soup = BeautifulSoup(indeed_resul.text, "html.parser")

pagination = indeed_soup.find("ul", {"class":"pagination-list"})

links = pagination.find_all('a')

pages = []

# 예시 1 
for link in links[:-1]:
  pages.append(int(link.find("span").string))

# 예시 2 
for link in links[:-1]:
  pages.append(int(link.string))

# 예시 1과 2 모두 동일한 값이 나옴 -> 더 간단한 예시 2 사용

# 마지막 페이지 값 찾아주기 
for link in links[:-1]:
  pages.append(int(link.string))
  
max_page = pages[-1]

[2, 3, 4, 5]
5

4.各ページの要求に従って抽出とファイルのモジュール化

from indeed import extract_indeed_pages, extract_indeed_jobs

last_indeed_page = extract_indeed_pages()

extract_indeed_jobs(last_indeed_page)

import requests
from bs4 import BeautifulSoup

LIMIT = 50 
URL = f"https://kr.indeed.com/%EC%B7%A8%EC%97%85?as_and=python&as_phr&as_any&as_not&as_ttl&as_cmp&jt=all&st&salary&radius=25&l=%EC%84%9C%EC%9A%B8&fromage=any&limit={LIMIT}"

def extract_indeed_pages():
  resul = requests.get(URL)
  soup = BeautifulSoup(resul.text, "html.parser")
  pagination = soup.find("ul", {"class":"pagination-list"})

  links = pagination.find_all('a')
  pages = []
  for link in links[:-1]:
    pages.append(int(link.string))
  
  max_page = pages[-1]
  return max_page

def extract_indeed_jobs(last_page):
  for page in range(last_page):
    result = requests.get(f"{URL}&start={page * LIMIT}")
    print(result.status_code)

Reference

この問題について(TIL|Pythonの追加&Webスクリプトの作成#1), 我々は、より多くの情報をここで見つけました https://velog.io/@sihaha/TIL-파이썬-추가-웹스크래퍼-만들기-1

テキストは自由に共有またはコピーできます。ただし、このドキュメントのURLは参考URLとして残しておいてください。

Collection and Share based on the CC Protocol

はんのう

15日間の多形性、抽象クラス、インタフェース