English Liguistics論文の概要のインポート

23557 ワード

Crawling スクロールテキストリンク

学術問題を準備するためには,口述学関連論文の要約データが必要である.良いウェブサイトを探すために、本当にたくさん探して、下のウェブサイトは悪くありません.

サイトのスクロール

lingbuzz - archive of linguistics articles

URL構造と説明

Base url : https://ling.auf.net/

カテゴリとページurl:https://ling.auf.net/lingbuzz/_listing?community=カテゴリ&start=開始番号

導入するカテゴリは4つ→語義・文法・音声・形態学

開始番号:1ページ目に30件の論文が存在→以降100件の論文が存在

base_url = base + "lingbuzz/_listing?community=" + cate + "&start=" + str(start)
res = requests.get(base_url)
html = bs(res.text, 'html.parser')

論文詳細ページ:https://ling.auf.net/lingbuzz/論文id

base = 'https://ling.auf.net/'
detail_page = requests.get(base + detail_url)

カテゴリページ

論文を含むテーブルはhtmlの2番目のテーブルであり,テーブル>tbody>trのtdの1番目のサブテーブルである.

tables = html.select('table')
temp_table = tables[2]
paper_table = temp_table.select_one('table')

tdには、trが各論文を含むテーブルもあります.

rows = paper_table.select('tr')

tr内tdの最後の要素は、論文の詳細ページのhrefを有する.

for row in rows:
	  detail_url = row.select('td')[-1].select_one("a")['href']
	  detail_page = requests.get(base + detail_url)
	  detail_page_html = bs(detail_page.text, 'html.parser')

Detail Page

論文タイトル:center>font>aラベル

bs 4オブジェクト.削除するタグ.extract():削除+戻りフラグ

title = detail_page_html.select_one("center").font.extract().text

作者紹介:center>aラベル

author_list = detail_page_html.select("center a")

# author 추출
author = ''
for person in author_list:
    author += person.text + ', '
author = author.strip(', ')

論文内容:bodyタグ内にタグが存在しない→残りのタグを削除しなければならない

bs 4オブジェクト.削除するタグ.分解():戻り値のないタグを削除

# abstract 만을 남겨놓기 위한 태그 삭제
detail_page_html.center.decompose()
detail_page_html.title.decompose()
detail_page_html.table.decompose()
detail_page_html.table.decompose()
detail_page_html.p.decompose()

abstract = detail_page_html.text

その他

論文Abstractにはフランス語などの符号化がなければ入れない文字がいくつかあるのでutf-8-sigで符号化した.ただし、utf-8で符号化できない特殊文字も存在する.たとえば、一重引用符(")、二重引用符("""")などは通常使用される文字ではないため、特殊文字置換が使用され、テキストにJSコードが含まれている場合があるため、置換も使用されます.
https://www.notion.so/image/https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Fsecure.notion-static.com%2Fa31122a9-b058-4d04-8e27-d3f512e06d44%2FUntitled.png?table=block&id=ec9f1884-9e62-468f-9374-e06b68592c14&width=3070&userId=5789effa-edf3-43f5-ad7c-8247a5264b62&cache=v2

完全なコード

import requests
from bs4 import BeautifulSoup as bs

def crawl(cate):
    total = []
    base = 'https://ling.auf.net/'
    'https://ling.auf.net/lingbuzz/_listing?community=Phonology&start=31'
    print(cate)
    print()
    start = 1
    while True:
        base_url = base + "lingbuzz/_listing?community=" + cate + "&start=" + str(start)
        res = requests.get(base_url)
        html = bs(res.text, 'html.parser')

        tables = html.select('table')
        temp_table = tables[2]
        paper_table = temp_table.select_one('table')
        if len(paper_table.text.strip()) == 0:
            break

        rows = paper_table.select('tr')

        for row in rows:
            paper = {'category' : cate}
            detail_url = row.select('td')[-1].select_one("a")['href']
            detail_page = requests.get(base + detail_url)
            detail_page_html = bs(detail_page.text, 'html.parser')

            # title 추출 및 제거
            title = detail_page_html.select_one("center").font.extract().text
            print(cate, len(total), title)
            paper['title'] = title
            author_list = detail_page_html.select("center a")

            # author 추출
            author = ''
            for person in author_list:
                author += person.text + ', '
            author = author.strip(', ')
            paper['author'] = author
						
						# abstract 만을 남겨놓기 위한 태그 삭제
            detail_page_html.center.decompose()
            detail_page_html.title.decompose()
            detail_page_html.table.decompose()
            detail_page_html.table.decompose()
            detail_page_html.p.decompose()

            abstract = detail_page_html.text.replace('’',"'").replace('“','"').replace("혻혻","").replace("/*<![CDATA[*/function onLoad(){};/*]]>*/", "").replace('”', '"').replace("—","-").replace("‘","'").strip()
            paper['abstract'] = abstract
            paper['url'] = base + detail_url
            total.append(paper)
				# 시작페이지 수 증가
        if start == 1:
            start += 30
        else:
            start += 100
		# csv로 저장
    import pandas as pd
    data = pd.DataFrame(total)
    data.to_csv(cate + ".csv", encoding='utf-16')

# 4개 토픽 스레드로 처리
import threading

category = ['phonology', 'semantics', 'syntax', 'morphology']

thread_count = len(category)
threads = []

# 새로운 스레드 생성/실행 후 스레드 리스트에 추가
for i in range(thread_count):
    thread = threading.Thread(target=crawl, args=( (category[i], ) ))
    thread.start()
    threads.append(thread)

# 메인 스레드는 각 스레드의 작업이 모두 끝날 때까지 대기
for thread in threads:
    thread.join()

Reference

この問題について(English Liguistics論文の概要のインポート), 我々は、より多くの情報をここで見つけました https://velog.io/@kjh03160/English-Linguistics-논문-abstract-가져오기

テキストは自由に共有またはコピーできます。ただし、このドキュメントのURLは参考URLとして残しておいてください。

Collection and Share based on the CC Protocol

「パフォーマンス最適化4.1」メモリ最適化の3つの側面とツール

TIL-[Database]MVC