ビッグデータ技術学習ノートのウェブサイトトラフィックログ分析項目:Flumeログ収集システム1

5571 ワード

技術学習ノート

一、サイトログトラフィック項目
-』プロジェクト開発段階:
-』実行可能性分析
-需要分析
-詳細設計
-』コード実装
-テスト
-』オン
-』ビッグデータビジネスプロセス
-』データ収集:sqoop、Flume、kafka、Logstash
-』データソース:ログファイル、RDBMS、リアルタイムのデータストリーム
-』目標地:hdfs、nosql、Hive
-』データストア:入庫プロセス
-』データ計算:hive、MapReduce、spark
-』データ洗浄
-』データモデリング
-』データ分析
-』データ展示:java web、可視化分析ツール


二、Flumeの使用
-』の特徴
       collecting, aggregating, and moving
コレクションの移動
       source、   channel、       sink
-』flumeの原理
-』source:データソースの読み取りを担当し、データソースのデータをデータストリームにし、eventにカプセル化する
eventはデータ収集の最小ユニットであり、
head:構成情報key=valueフォーマットをいくつか入れる
body:本物のデータ
-』channel:データの一時保存
-』sink:ターゲットにデータを送信する


三、Flumeの配置
-ダウンロード
       tar -zxvf flume-ng-1.6.0-cdh5.7.6.tar.gz -C/opt/cdh-5.7.6/
-』プロファイルの変更
       mv conf/flume-env.sh.template conf/flume-env.sh
       export JAVA_HOME=/opt/modules/jdk1.8.0_91

-hdfsを見つける方法
-』グローバル環境変数の構成:HADOOP_HOME
-』プロファイルにHADOOP_を明記するHOME
-』agentにhdfsの絶対アドレスを明記
               hdfs://hostname:8020/flume
-hdfsがHAを構成している場合
-』core-siteとhdfs-siteをflumeのプロファイルディレクトリにコピー

-』core-siteとhdfs-siteをflumeのプロファイルディレクトリにコピー
       cp ../hadoop-2.6.0-cdh5.7.6/etc/hadoop/core-site.xml ../hadoop-2.6.0-cdh5.7.6/etc/hadoop/hdfs-site.xml conf/

-』flume書き込みデータをhdfsに必要なjarパケットにインポートflumeのlibディレクトリにインポート
           commons-configuration-1.6.jar
           hadoop-auth-2.6.0-cdh5.7.6.jar
           hadoop-common-2.6.0-cdh5.7.6.jar
           hadoop-hdfs-2.6.0-cdh5.7.6.jar
           htrace-core4-4.0.1-incubating.jar


四、flumeの使用
-』flumeの実行方法:
flume-og:古いバージョン
flume-ng:新しいバージョン
       Usage: bin/flume-ng [options]...
           bin/flume-ng agent --conf $flume_conf_dir --name agent_name --conf-file agent_file_path -Dflume.root.logger=INFO,console

-』ケース1:hiveのログを読み、hiveのログをloggerに収集する
       agent:
source:hiveログを読み、channelにログデータを送信
channel:sourceから送られてきたデータ、メモリを格納
sink:channelからデータを取り出し、ログに送信

-実行
           bin/flume-ng agent --conf conf/--name a1 --conf-file case/hive-mem-log.properties -Dflume.root.logger=INFO,console

-』ケース2:file channelの使用
       bin/flume-ng agent --conf conf/--name a1 --conf-file case/hive-file-log.properties -Dflume.root.logger=INFO,console

mem:読み書きが速く、データが失われやすい
file:相対的に速度は遅いが、データのセキュリティが高い

-ケース3:hdfsへのデータ収集
       bin/flume-ng agent --conf conf/--name a1 --conf-file case/hive-mem-hdfs.properties -Dflume.root.logger=INFO,console

-』プロファイルサイズ
-時間によるファイルの生成
               hdfs.rollInterval=0
-』ファイルサイズによるファイルの生成:デフォルト1024バイト
               hdfs.rollSize=10240(仕事では一般的に125 M程度のバイト数)
-』event個数でファイルを生成
               hdfs.rollCount=0

           bin/flume-ng agent --conf conf/--name a1 --conf-file case/hive-mem-size.properties -Dflume.root.logger=INFO,console


-時間ごとに対応するディレクトリを生成
           bin/flume-ng agent --conf conf/--name a1 --conf-file case/hive-mem-part.properties -Dflume.root.logger=INFO,console

-』設定ファイル名ヘッダ:hdfs.filePrefix
-スレッドタイムアウト時間の設定:hdfs.idleTimeout

-ケース4:
       logs/2018-04-02.log
           2018-04-03.log
           2018-04-04.log

-』spooling dir source:ディレクトリ内のファイルを動的に読み込む
実行:
       bin/flume-ng agent --conf conf/--name a1 --conf-file case/dir-mem-size.properties -Dflume.root.logger=INFO,console

       logs/2018-04-02.log.tmp -> 2018-04-02.log
           2018-04-03.log.tmp
           2018-04-04.log

-』ケース5:
       logs/2018-04-02.log
           2018-04-03.log
           2018-04-04.log

-』taildir sourceの使用
-』古いバージョンのflumeを使用する場合、この機能がない場合は、taildirのソースコードを自分でコンパイルする必要があります.
-』flume-1.7のtaildir sourceのソースコードを見つけます
-』eclipseのインポート
-クラスファイルがありません
C:Users江城子DesktopGitflumeflume-ng-coresrcmainjavaorgapacheflumesourcePollableSourceConstants.java
-』overwrite注記を2つ削除
-』mavenコンパイルjarパッケージ
-』jarパッケージをlibディレクトリに挿入


-』flumeでよく使われるコンポーネントタイプ
       -》source:avro source/sink ,kafka source ,exec source ,spooldir source   ,taildir source
       -》channel:file、mem、kafka
       -》sink:kafka、hdfs、hive

# The configuration file needs to define the sources, 
# the channels and the sinks.
# Sources, channels and sinks are defined per agent, 
# in this case called 'a1'

# define agent
a1.sources = s1
a1.channels = c1
a1.sinks = k1

# define source
a1.sources.s1.type = spooldir
a1.sources.s1.spoolDir = /opt/datas/flume/spooling
a1.sources.s1.ignorePattern = ([^ ]*\.tmp$)

# define channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# define sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/spoolingdir
a1.sinks.k1.hdfs.writeFormat = Text
a1.sinks.k1.hdfs.rollInterval = 0
a1.sinks.k1.hdfs.rollSize = 10240
a1.sinks.k1.hdfs.rollCount = 0

# bond
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

redis-infq-infQをredisに統合

HBase scan timerange