国产日韩欧美一区二区ppp，久久久久免费一区精品，国产精品激情AV在线播放，欧美大片久久，欧美高清e片在线观看，一区二区日本精品理论片，日本孩交网站链接，在线日韩欧美国产二区

開發(fā)技術(shù) / Technology

您的當(dāng)前位置：網(wǎng)站首頁 > 行業(yè)洞察 > 開發(fā)技術(shù)

一個(gè)簡單數(shù)據(jù)處理例子

日期：2015年1月29日作者：zhjw 來源：互聯(lián)網(wǎng) 點(diǎn)擊：716

一個(gè)簡單數(shù)據(jù)處理例子

　　1、Pig數(shù)據(jù)模型

　　　　Bag：表

　　　　Tuple：行，記錄

　　　　Field：屬性

　　　　Pig不要求同一個(gè)Bag里面的各個(gè)Tuple有相同數(shù)量或相同類型的Field

　　2、Pig Lating常用語句

　　　　1）LOAD:指出載入數(shù)據(jù)的方法

　　　　2）FOREACH：逐行掃描進(jìn)行某種處理

　　　　3）FILTER：過濾行

　　　　4）DUMP：把結(jié)果顯示到屏幕

　　　　5）STORE：把結(jié)果保存到文件

　　3、簡單例子：

　　　　假如有一份成績單，有學(xué)號、語文成績、數(shù)學(xué)成績，屬性之間用|分隔，如下：

20130001|80|90
20130002|85|96
20130003|60|70
20130004|74|86
20130005|65|98

　　1）把文件從本地系統(tǒng)上傳到Hadoop

[coder@h1 hadoop-0.20.2]$ bin/hadoop dfs -put /home/coder/score.txt in

　　查看是否上傳成功:

[coder@h1 hadoop-0.20.2]$ bin/hadoop dfs -ls /user/coder/in
Found 1 items
-rw-r--r--   2 coder supergroup         75 2013-04-20 14:33 /user/coder/in/score.txt

　　2）載入原始數(shù)據(jù)，使用LOAD

grunt> scores = LOAD 'hdfs://h1:9000/user/coder/in/score.txt' USING PigStorage('|') AS (num:int,Chinese:int,Math:int);

　　輸入文件是：'hdfs://h1:9000/user/coder/in/score.txt'

　　表名（Bag）：scores

　　從輸入文件讀取數(shù)據(jù)（Tuple）時(shí)以 | 分隔

　　讀取的Tuple包含3個(gè)屬性，分別為學(xué)號（num）、語文成績（Chinese）和數(shù)學(xué)成績（Math），這三個(gè)屬性的數(shù)據(jù)類型都為int

　　3）查看表的結(jié)構(gòu)

grunt> DESCRIBE scores;
scores: {num: int,Chinese: int,Math: int}

　　4）假如我們需要過濾掉學(xué)號為20130005的記錄

grunt> filter_scores = FILTER scores BY num != 20130005;

　　查看過濾后的記錄

grunt> dump filter_scores;
(20130001,80,90)
(20130002,85,96)
(20130003,60,70)
(20130004,74,86)

　　5）計(jì)算每個(gè)人的總分

grunt> totalScore = FOREACH scores GENERATE num,Chinese+Math;

　　查看結(jié)果：

grunt> dump totalScore;

(20130001,170)
(20130002,181)
(20130003,130)
(20130004,160)
(20130005,163)

　　6）將每個(gè)人的總分結(jié)果輸出到文件

grunt> store totalScore into 'hdfs://h1:9000/user/coder/out/result' using PigStorage('|');

　　查看結(jié)果：

[coder@h1 ~]$ hadoop dfs -ls /user/coder/out/result
Found 2 items
drwxr-xr-x   - coder supergroup          0 2013-04-20 15:54 /user/coder/out/result/_logs
-rw-r--r--   2 coder supergroup         65 2013-04-20 15:54 /user/coder/out/result/part-m-00000
[coder@h1 ~]$ ^C
[coder@h1 ~]$ hadoop dfs -cat /user/coder/out/result/*
20130001|170
20130002|181
20130003|130
20130004|160
20130005|163
cat: Source must be a file.
[coder@h1 ~]$

　　再看一個(gè)小例子：

　　有一批如下格式的文件：

zhangsan#123456#zhangsan@qq.com
lisi#434dfdds#lisi@126.com
wangwu#ffere233#wangwu@163.com
zhouliu#fgrtr43#zhouliu@139.com

　　每行記錄有三個(gè)字段：賬號、密碼、郵箱，字段之間以#號分隔，現(xiàn)在要提取這批文件中的郵箱。

　　1）上傳文件到hadoop

[coder@h1 hadoop-0.20.2]$ bin/hadoop dfs -put data.txt in

[coder@h1 hadoop-0.20.2]$ bin/hadoop dfs -ls /user/coder/in
Found 1 items
-rw-r--r--   2 coder supergroup        122 2013-04-24 20:34 /user/coder/in/data.txt
[coder@h1 hadoop-0.20.2]$

　　2）載入原始數(shù)據(jù)文件

grunt> T_A = LOAD '/user/coder/in/data.txt' using PigStorage('#') as (username:chararray,password:chararray,email:chararray);

　　3）取出email字段

grunt> T_B = FOREACH T_A GENERATE email;

　　4）把結(jié)果輸出到文件

grunt> STORE T_B INTO '/user/coder/out/email'

　　5）查看結(jié)果

[coder@h1 hadoop-0.20.2]$ bin/hadoop dfs -cat /user/coder/out/email/*
zhangsan@qq.com
lisi@126.com
wangwu@163.com
zhouliu@139.com
cat: Source must be a file.

昆明逆火科技有限公司

一個(gè)簡單數(shù)據(jù)處理例子

一個(gè)簡單數(shù)據(jù)處理例子