• <th id="kadjp"></th>

            1. ?
                開發(fā)技術(shù) / Technology

                一個(gè)簡單數(shù)據(jù)處理例子

                日期:2015年1月29日  作者:zhjw  來源:互聯(lián)網(wǎng)    點(diǎn)擊:716

                  1、Pig數(shù)據(jù)模型

                    Bag:表

                    Tuple:行,記錄

                    Field:屬性

                    Pig不要求同一個(gè)Bag里面的各個(gè)Tuple有相同數(shù)量或相同類型的Field

                  2、Pig Lating常用語句

                    1)LOAD:指出載入數(shù)據(jù)的方法

                    2)FOREACH:逐行掃描進(jìn)行某種處理

                    3)FILTER:過濾行

                    4)DUMP:把結(jié)果顯示到屏幕

                    5)STORE:把結(jié)果保存到文件

                  3、簡單例子:

                    假如有一份成績單,有學(xué)號、語文成績、數(shù)學(xué)成績,屬性之間用|分隔,如下:

                20130001|80|90
                20130002|85|96
                20130003|60|70
                20130004|74|86
                20130005|65|98

                  1)把文件從本地系統(tǒng)上傳到Hadoop

                [coder@h1 hadoop-0.20.2]$ bin/hadoop dfs -put /home/coder/score.txt in

                  查看是否上傳成功:

                [coder@h1 hadoop-0.20.2]$ bin/hadoop dfs -ls /user/coder/in
                Found 1 items
                -rw-r--r--   2 coder supergroup         75 2013-04-20 14:33 /user/coder/in/score.txt

                  2)載入原始數(shù)據(jù),使用LOAD

                grunt> scores = LOAD 'hdfs://h1:9000/user/coder/in/score.txt' USING PigStorage('|') AS (num:int,Chinese:int,Math:int);

                  輸入文件是:'hdfs://h1:9000/user/coder/in/score.txt'

                  表名(Bag):scores

                  從輸入文件讀取數(shù)據(jù)(Tuple)時(shí)以 | 分隔

                  讀取的Tuple包含3個(gè)屬性,分別為學(xué)號(num)、語文成績(Chinese)和數(shù)學(xué)成績(Math),這三個(gè)屬性的數(shù)據(jù)類型都為int

                  3)查看表的結(jié)構(gòu)

                grunt> DESCRIBE scores;
                scores: {num: int,Chinese: int,Math: int}

                  4)假如我們需要過濾掉學(xué)號為20130005的記錄

                grunt> filter_scores = FILTER scores BY num != 20130005;

                  查看過濾后的記錄

                grunt> dump filter_scores;
                (20130001,80,90)
                (20130002,85,96)
                (20130003,60,70)
                (20130004,74,86)

                  5)計(jì)算每個(gè)人的總分

                grunt> totalScore = FOREACH scores GENERATE num,Chinese+Math;

                  查看結(jié)果:

                grunt> dump totalScore;

                 

                (20130001,170)
                (20130002,181)
                (20130003,130)
                (20130004,160)
                (20130005,163)

                  

                  6)將每個(gè)人的總分結(jié)果輸出到文件

                grunt> store totalScore into 'hdfs://h1:9000/user/coder/out/result' using PigStorage('|');

                  查看結(jié)果:

                復(fù)制代碼
                [coder@h1 ~]$ hadoop dfs -ls /user/coder/out/result
                Found 2 items
                drwxr-xr-x   - coder supergroup          0 2013-04-20 15:54 /user/coder/out/result/_logs
                -rw-r--r--   2 coder supergroup         65 2013-04-20 15:54 /user/coder/out/result/part-m-00000
                [coder@h1 ~]$ ^C
                [coder@h1 ~]$ hadoop dfs -cat /user/coder/out/result/*
                20130001|170
                20130002|181
                20130003|130
                20130004|160
                20130005|163
                cat: Source must be a file.
                [coder@h1 ~]$ 
                復(fù)制代碼

                 


                  再看一個(gè)小例子:

                  有一批如下格式的文件:

                zhangsan#123456#zhangsan@qq.com
                lisi#434dfdds#lisi@126.com
                wangwu#ffere233#wangwu@163.com
                zhouliu#fgrtr43#zhouliu@139.com

                  每行記錄有三個(gè)字段:賬號、密碼、郵箱,字段之間以#號分隔,現(xiàn)在要提取這批文件中的郵箱。

                  

                  1)上傳文件到hadoop

                [coder@h1 hadoop-0.20.2]$ bin/hadoop dfs -put data.txt in

                 

                [coder@h1 hadoop-0.20.2]$ bin/hadoop dfs -ls /user/coder/in
                Found 1 items
                -rw-r--r--   2 coder supergroup        122 2013-04-24 20:34 /user/coder/in/data.txt
                [coder@h1 hadoop-0.20.2]$ 

                  2)載入原始數(shù)據(jù)文件

                grunt> T_A = LOAD '/user/coder/in/data.txt' using PigStorage('#') as (username:chararray,password:chararray,email:chararray);

                  3)取出email字段

                grunt> T_B = FOREACH T_A GENERATE email;

                  4)把結(jié)果輸出到文件

                grunt> STORE T_B INTO '/user/coder/out/email'

                  5)查看結(jié)果

                [coder@h1 hadoop-0.20.2]$ bin/hadoop dfs -cat /user/coder/out/email/*
                zhangsan@qq.com
                lisi@126.com
                wangwu@163.com
                zhouliu@139.com
                cat: Source must be a file.

                 

                国产欧美在线观看,国产精品白浆冒出视频,91精品国产91热久久久福利,大伊香蕉在线精品视频97 国产精品美女久久福利 国产精品黄的免费观看
              • <th id="kadjp"></th>