北郵データマイニングとデータ倉庫--テキスト分類実験(一)


北郵データマイニングとデータ倉庫--テキスト分類実験(一)
実験要求:10種類のテキストを収集し、各テキストに100000、合計100万件のデータを含み、素朴なベイズまたはSVMを利用してテキスト分類を行う.
  • 収集データ(爬虫類)
  • 中科院分詞ツールpynpir分詞
  • を利用
  • sklearnを用いて単語を計算するtf-idf
  • 素朴ベイズによるテキスト分類
  • データ収集(爬虫類)ニュース類の実験データが収集しやすいため、私たちは各ニュースサイト、例えば新浪、中国新聞網などから10種類のテキストデータを抽出しました.それぞれ軍事、自動車、金融、教育、ゲーム、健康、IT、スポーツ、娯楽、時尚の10種類のニュース文章です.爬虫類のデータは10万以上あります.私たち自身がscrapyの枠組みで書いた爬虫類プログラムの実行効率が悪いからです.また、一部のニュース記事は本文の内容が限られており、分詞後にデータが残っているかどうかは確定できません.以下は爬虫類コード:------items.pyは、各クラスのニュースに対してクラスを定義し、登ったデータにはタイトル、URL、本文の内容が含まれています.
    # -*- coding: utf-8 -*-
    
    # Define here the models for your scraped items
    #
    # See documentation in:
    # https://doc.scrapy.org/en/latest/topics/items.html
    
    import scrapy
    
    
    class SportsItem(scrapy.Item):
        title = scrapy.Field()
        link = scrapy.Field()
        content = scrapy.Field()
        pass
    class EconomyItem(scrapy.Item):
        title = scrapy.Field()
        link = scrapy.Field()
        content = scrapy.Field()
        pass
    class PoliItem(scrapy.Item):
        No = scrapy.Field()
        title = scrapy.Field()
        link = scrapy.Field()
        content = scrapy.Field()
    class CultureItem(scrapy.Item):
        No = scrapy.Field()
        title = scrapy.Field()
        link = scrapy.Field()
        content = scrapy.Field()
    class EduItem(scrapy.Item):
        No = scrapy.Field()
        title = scrapy.Field()
        link = scrapy.Field()
        content = scrapy.Field()
    class ArmyItem(scrapy.Item):
        No = scrapy.Field()
        title = scrapy.Field()
        link = scrapy.Field()
        content = scrapy.Field()
    class SciItem(scrapy.Item):
        No = scrapy.Field()
        title = scrapy.Field()
        link = scrapy.Field()
        content = scrapy.Field()
    class TrendItem(scrapy.Item):
        No = scrapy.Field()
        title = scrapy.Field()
        link = scrapy.Field()
        content = scrapy.Field()
    class GameItem(scrapy.Item):
        No = scrapy.Field()
        title = scrapy.Field()
        link = scrapy.Field()
        content = scrapy.Field()
    class YuleItem(scrapy.Item):
        No = scrapy.Field()
        title = scrapy.Field()
        link = scrapy.Field()
        content = scrapy.Field()
    

    ---------pipelines.py各クラスはTXTファイルに格納され、各記事に番号付けされます.
    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    
    from items import SportsItem,EconomyItem,PoliItem,CultureItem,EduItem,ArmyItem,SciItem,TrendItem,GameItem,YuleItem
    
    
    class DatasetPipeline(object):
        def process_item(self, item, spider):
            if isinstance(item, SportsItem):
                with open('/home/hya/DataSet/sports.txt', 'a') as fp:
                    fp.write(item['title'].encode('utf-8')+'
    '+item['link'].encode('utf-8')+'
    '+item['content'].encode('utf-8')+'

    ') return item elif isinstance(item, EconomyItem): with open('/home/hya/DataSet/economy.txt', 'a') as fp: fp.write(item['title'].encode('utf-8')+'
    '+item['link'].encode('utf-8')+'
    '+item['content'].encode('utf-8')+'

    ') return item elif isinstance(item, PoliItem): with open('D:/poli.txt', 'a') as fp: fp.write(str(item['No'])+'
    '+item['title'].encode('utf-8')+'
    '+item['link'].encode('utf-8')+'
    '+item['content'].encode('utf-8')+'

    ') elif isinstance(item, CultureItem): with open('D:/poli.txt', 'a') as fp: fp.write(str(item['No'])+'
    '+item['title'].encode('utf-8')+'
    '+item['link'].encode('utf-8')+'
    '+item['content'].encode('utf-8')+'

    ') elif isinstance(item, EduItem): with open('D:/edu.txt', 'a') as fp: fp.write(str(item['No'])+'
    '+item['title'].encode('utf-8')+'
    '+item['link'].encode('utf-8')+'
    '+item['content'].encode('utf-8')+'

    ') elif isinstance(item, ArmyItem): with open('D:/army.txt', 'a') as fp: fp.write(str(item['No'])+'
    '+item['title'].encode('utf-8')+'
    '+item['link'].encode('utf-8')+'
    '+item['content'].encode('utf-8')+'

    ') elif isinstance(item, SciItem): with open('D:/sci.txt', 'a') as fp: fp.write(str(item['No']) + '
    ' + item['title'].encode('utf-8') + '
    ' + item['link'].encode('utf-8') + '
    ' + item['content'].encode('utf-8') + '

    ') elif isinstance(item, TrendItem): with open('D:/trend.txt', 'a') as fp: fp.write(str(item['No']) + '
    ' + item['title'].encode('utf-8') + '
    ' + item['link'].encode('utf-8') + '
    ' + item['content'].encode('utf-8') + '

    ') elif isinstance(item, GameItem): with open('D:/Data/dataset/game.txt', 'a') as fp: fp.write(str(item['No']) + '
    ' + item['title'].encode('utf-8') + '
    ' + item['link'].encode('utf-8') + '
    ' + item['content'].encode('utf-8') + '

    ') elif isinstance(item, YuleItem): with open('D:/Data/dataset/yule.txt', 'a') as fp: fp.write(str(item['No']) + '
    ' + item['title'].encode('utf-8') + '
    ' + item['link'].encode('utf-8') + '
    ' + item['content'].encode('utf-8') + '

    ')

    ------game.py以下はゲームを這い出すプログラムです
    # coding:utf-8
    
    import re
    import scrapy
    from scrapy.http import Request
    from scrapy.selector import Selector
    from DataSet.items import GameItem
    import time
    
    count = 0
    class GameSpider(scrapy.spiders.Spider):
        name = "game"
        #http://roll.mil.news.sina.com.cn/col/gjjq/index.shtml
        #http://www.diyiyou.com/news/gnxw/index_2863.html
        s = "https://www.app178.com/xinwen_"
        m = ".html"
        start_urls = ["https://www.app178.com/xinwen_1.html", ]
        for i in range(2, 1386):
            url = s+str(i)+m
            start_urls.append(url)
    
        def parse(self, response):
            selector = Selector(response)
            #/html/body/div[2]/div/div[1]/ul/li[1]/div[1]/a
            links = selector.xpath('//*[@class="list_left"]/ul/li/div/a/@href').extract()
            print links
            titles = selector.xpath('//*[@class="list_left"]/ul/li/div/a/text()').extract()
            for i in range(len(links)):
                h = "https://www.app178.com"
                link = h + links[i].strip()
                print link
                title = titles[i]
                yield Request(link.encode('utf-8'), meta={'title': title, 'link': link},
                              callback=self.parse_content)  # parse content
    
        def parse_content(self, response):
            global count
            item = GameItem()
            item["link"] = response.meta['link']
            item["title"] = response.meta['title']
            #print "in parse_content"
            sel = Selector(response)
            #/html/body/div[7]/div/div/em/em/div[1]/p[6]/text()/html/body/div[7]/div/div/em/em/div[1]/p[6]/text()
            content = sel.xpath('//*[@class="jjzq_ny_left1_main"]/p/text()').extract()
            #content = content.strip()
            #content = content.replace(" ","")
            if len(content) != 0:
                tmp = ''.join(content)
                item['content'] = tmp
                count = count + 1
                item['No'] = count
                print count
                return item