sqoop2:從mysql導出數據到hdfs
sqoop2:從mysql導出數據到hdfs中
sqoop-shell
啟動sqoopp-shell
jjzhu:bin didi$ sqoop2-shell
Setting conf dir: /opt/sqoop-1.99.7/bin/../conf
Sqoop home directory: /opt/sqoop-1.99.7
Sqoop Shell: Type 'help' or '\h' for help.
sqoop:000> set server --host localhost --port 12000 --webapp sqoop
Server is set successfully
sqoop:000> show version --all
client version:
Sqoop 1.99.7 source revision 435d5e61b922a32d7bce567fe5fb1a9c0d9b1bbb
Compiled by abefine on Tue Jul 19 16:08:27 PDT 2016
0 [main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
server version:
Sqoop 1.99.7 source revision 435d5e61b922a32d7bce567fe5fb1a9c0d9b1bbb
Compiled by abefine on Tue Jul 19 16:08:27 PDT 2016
API versions:
[v1]
sqoop:000>
配置sqoop server
sqoop:000> set server --host localhost --port 12000 --webapp sqoop
Server is set successfully
查看server連接是否可用
sqoop:000> show version --all
client version:
Sqoop 1.99.7 source revision 435d5e61b922a32d7bce567fe5fb1a9c0d9b1bbb
Compiled by abefine on Tue Jul 19 16:08:27 PDT 2016
0 [main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
server version:
Sqoop 1.99.7 source revision 435d5e61b922a32d7bce567fe5fb1a9c0d9b1bbb
Compiled by abefine on Tue Jul 19 16:08:27 PDT 2016
API versions:
[v1]
sqoop:000>
創建鏈接
查看sqoop server上可用的鏈接
sqoop:000> show connector
+------------------------+---------+------------------------------------------------------------+----------------------+
| Name | Version | Class | Supported Directions |
+------------------------+---------+------------------------------------------------------------+----------------------+
| generic-jdbc-connector | 1.99.7 | org.apache.sqoop.connector.jdbc.GenericJdbcConnector | FROM/TO |
| kite-connector | 1.99.7 | org.apache.sqoop.connector.kite.KiteConnector | FROM/TO |
| oracle-jdbc-connector | 1.99.7 | org.apache.sqoop.connector.jdbc.oracle.OracleJdbcConnector | FROM/TO |
| ftp-connector | 1.99.7 | org.apache.sqoop.connector.ftp.FtpConnector | TO |
| hdfs-connector | 1.99.7 | org.apache.sqoop.connector.hdfs.HdfsConnector | FROM/TO |
| kafka-connector | 1.99.7 | org.apache.sqoop.connector.kafka.KafkaConnector | TO |
| sftp-connector | 1.99.7 | org.apache.sqoop.connector.sftp.SftpConnector | TO |
+------------------------+---------+------------------------------------------------------------+----------------------+
sqoop:000>
- generic-jdbc-connector 依賴於java JDBC的connector,可以作為數據導入的數據源和目標源
- hdfs-connector 以hdfs作為數據源或者目標源的connector
用如下命令創建一個generic-jdbc-connector的鏈接
sqoop:002> create link -c generic-jdbc-connector
Creating link for connector with name generic-jdbc-connector
Please fill following values to create new link object
Name: mysql_weibouser_link
Database connection
Driver class: com.mysql.jdbc.Driver
Connection String: jdbc:mysql://127.0.0.1:3306/spider
Username: root
Password: ****
Fetch Size:
Connection Properties:
There are currently 0 values in the map:
entry# protocol=tcp
There are currently 1 values in the map:
protocol = tcp
entry#
SQL Dialect
Identifier enclose: **注意 這裏不能直接回車!要打一個空格符號!因為如果不打,查詢mysql表的時候會在表上加上“”,導致查詢出錯!
**
New link was successfully created with validation status OK and name mysql_weibouser_link
創建hdfs link
sqoop:002> create link -c hdfs-connector
Creating link for connector with name hdfs-connector
Please fill following values to create new link object
Name: hdfs_weibouser_link
HDFS cluster
URI: hdfs://localhost:9000
Conf directory: /opt/hadoop-2.7.3/etc/hadoop
Additional configs::
There are currently 0 values in the map:
entry#
New link was successfully created with validation status OK and name hdfs_weibouser_link
查看link
sqoop:002> show link
+----------------------+------------------------+---------+
| Name | Connector Name | Enabled |
+----------------------+------------------------+---------+
| mysql_weibouser | generic-jdbc-connector | true |
| mysql_weibouser_link | generic-jdbc-connector | true |
| hdfs_link | hdfs-connector | true |
| hdfs_link2 | hdfs-connector | true |
| hdfs_weibouser_link | hdfs-connector | true |
+----------------------+------------------------+---------+
創建job
sqoop:002> create job -f "mysql_weibouser_link" -t "hdfs_weibouser_link"
Creating job for links with from name mysql_weibouser_link and to name hdfs_weibouser_link
Please fill following values to create new job object
Name: job_weibouser
Database source
Schema name: spider
Table name: spiders_weibouser
SQL statement:
Column names:
There are currently 0 values in the list:
element#
Partition column:
Partition column nullable:
Boundary query:
Incremental read
Check column:
Last value:
Target configuration
Override null value:
Null value:
File format:
0 : TEXT_FILE
1 : SEQUENCE_FILE
2 : PARQUET_FILE
Choose: 0
Compression codec:
0 : NONE
1 : DEFAULT
2 : DEFLATE
3 : GZIP
4 : BZIP2
5 : LZO
6 : LZ4
7 : SNAPPY
8 : CUSTOM
Choose: 0
Custom codec:
Output directory: hdfs://localhost:9000/usr/jjzhu/spider/spiders_weibouser
Append mode:
Throttling resources
Extractors: 2
Loaders: 2
Classpath configuration
Extra mapper jars:
There are currently 0 values in the list:
element#
New job was successfully created with validation status OK and name job_weibouser
各參數意義:
以下是各個屬性
Name:一個標示符,自己指定即可。
Schema Name:指定Database或Schema的名字,在MySQL中,Schema同Database類似,具體什麼區別沒有深究過,但官網描述在創建時差不多。。
Table Name:自己指定導出的表。
SQL Statement:就是sql查詢語句,文檔上說需要指定一個$condition,但我一直沒有創建成功,貌似是一個條件子句。
配置完以上幾項,又回出現element#提示符,提示輸入一些hash值,直接回車過。
Partition column:
Partition column nullable:
Boundary query
Last value
後麵需要配置數據目的地各項值:
Null alue:大概說的是如果有空值用什麼覆蓋
File format:指定在HDFS中的數據文件是什麼文件格式,這裏使用TEXT_FILE,即最簡單的文本文件。
Compression codec:用於指定使用什麼壓縮算法進行導出數據文件壓縮,我指定NONE,這個也可以使用自定義的壓縮算法CUSTOM,用Java實現相應的接口。
Custom codec:這個就是指定的custom壓縮算法,本例選擇NONE,所以直接回車過去。
Output directory:指定存儲在HDFS文件係統中的路徑,這裏最好指定一個存在的路徑,或者存在但路勁下是空的,貌似這樣才能成功。
Append mode:用於指定是否是在已存在導出文件的情況下將新數據追加到數據文件中。
Extractors:2
Loaders:2
最後再次出現element#提示符,用於輸入extra mapper jars的屬性,可以什麼都不寫。直接回車。
至此若出現successful則證明已經成功創建。
查看創建的job
sqoop:002> show job
+----+---------------+-----------------------------------------------+--------------------------------------+---------+
| Id | Name | From Connector | To Connector | Enabled |
+----+---------------+-----------------------------------------------+--------------------------------------+---------+
| 1 | spider_job | mysql_weibouser (generic-jdbc-connector) | hdfs_link (hdfs-connector) | true |
| 2 | job_weibouser | mysql_weibouser_link (generic-jdbc-connector) | hdfs_weibouser_link (hdfs-connector) | true |
+----+---------------+-----------------------------------------------+--------------------------------------+---------+
sqoop:002>
啟動job
start job -n job_weibouser
sqoop:002> start job -n job_weibouser
Submission details
Job Name: job_weibouser
Server URL: https://localhost:12000/sqoop/
Created by: didi
Creation date: 2017-04-11 14:37:46 CST
Lastly updated by: didi
External ID: job_1491888730134_0003
https://jjzhu:8088/proxy/application_1491888730134_0003/
2017-04-11 14:37:46 CST: BOOTING - Progress is not available
查看job運行狀態
sqoop:002> status job -n job_weibouser
Submission details
Job Name: job_weibouser
Server URL: https://localhost:12000/sqoop/
Created by: didi
Creation date: 2017-04-11 14:37:46 CST
Lastly updated by: didi
External ID: job_1491888730134_0003
https://jjzhu:8088/proxy/application_1491888730134_0003/
2017-04-11 14:38:41 CST: SUCCEEDED
Counters:
org.apache.hadoop.mapreduce.FileSystemCounter
FILE_LARGE_READ_OPS: 0
FILE_WRITE_OPS: 0
HDFS_READ_OPS: 2
HDFS_BYTES_READ: 290
HDFS_LARGE_READ_OPS: 0
FILE_READ_OPS: 0
FILE_BYTES_WRITTEN: 51361466
FILE_BYTES_READ: 25115854
HDFS_WRITE_OPS: 2
HDFS_BYTES_WRITTEN: 24652721
org.apache.hadoop.mapreduce.lib.output.FileOutputFormatCounter
BYTES_WRITTEN: 0
org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter
BYTES_READ: 0
org.apache.hadoop.mapreduce.JobCounter
TOTAL_LAUNCHED_MAPS: 2
VCORES_MILLIS_REDUCES: 20225
MB_MILLIS_MAPS: 27120640
TOTAL_LAUNCHED_REDUCES: 2
SLOTS_MILLIS_REDUCES: 20225
VCORES_MILLIS_MAPS: 26485
MB_MILLIS_REDUCES: 20710400
SLOTS_MILLIS_MAPS: 26485
MILLIS_REDUCES: 20225
OTHER_LOCAL_MAPS: 2
MILLIS_MAPS: 26485
org.apache.sqoop.submission.counter.SqoopCounters
ROWS_READ: 109408
ROWS_WRITTEN: 109408
org.apache.hadoop.mapreduce.TaskCounter
MAP_OUTPUT_MATERIALIZED_BYTES: 25115866
REDUCE_INPUT_RECORDS: 109408
SPILLED_RECORDS: 218816
MERGED_MAP_OUTPUTS: 4
VIRTUAL_MEMORY_BYTES: 0
MAP_INPUT_RECORDS: 0
SPLIT_RAW_BYTES: 290
FAILED_SHUFFLE: 0
MAP_OUTPUT_BYTES: 24762129
REDUCE_SHUFFLE_BYTES: 25115866
PHYSICAL_MEMORY_BYTES: 0
GC_TIME_MILLIS: 1648
REDUCE_INPUT_GROUPS: 109408
COMBINE_OUTPUT_RECORDS: 0
SHUFFLED_MAPS: 4
REDUCE_OUTPUT_RECORDS: 109408
MAP_OUTPUT_RECORDS: 109408
COMBINE_INPUT_RECORDS: 0
CPU_MILLISECONDS: 0
COMMITTED_HEAP_BYTES: 1951399936
Shuffle Errors
CONNECTION: 0
WRONG_LENGTH: 0
BAD_ID: 0
WRONG_MAP: 0
WRONG_REDUCE: 0
IO_ERROR: 0
Job executed successfully
查看hdfs的相關路徑,看是否有輸出文件
jjzhu:~ didi$ hdfs dfs -ls /usr/jjzhu/spider
Found 4 items
drwxr-xr-x - didi supergroup 0 2017-04-11 14:38 /usr/jjzhu/spider/spiders_weibouser
drwxr-xr-x - 777 supergroup 0 2017-04-11 10:58 /usr/jjzhu/spider/weibouser
drwxr-xr-x - 777 supergroup 0 2017-04-11 13:33 /usr/jjzhu/spider/weobouser
drwxr-xr-x - didi supergroup 0 2017-04-11 13:39 /usr/jjzhu/spider/weobouser2
jjzhu:~ didi$ hdfs dfs -ls /usr/jjzhu/spider/spiders_weibouser
Found 2 items
-rw-r--r-- 1 didi supergroup 12262783 2017-04-11 14:38 /usr/jjzhu/spider/spiders_weibouser/33b56441-b638-48cc-8d0d-37a808f25653.txt
-rw-r--r-- 1 didi supergroup 12389938 2017-04-11 14:38 /usr/jjzhu/spider/spiders_weibouser/73b20d50-de72-4aea-8c8c-d97cdc48e667.txt
也可以訪問hdfs的webui進行查看https://localhost:50070/
注意事項
定位問題一定要查看日誌,sqoop的日誌輸出在sqoop.properties 配的路徑sqoop.log
- 在配置hdfs文件路徑的時候,確保有寫權限 通過如下命令修改hdfs下文件的權限
hdfs dfs -chown -R 777 HDFS_PATH
- RemoteException:User: xxx is not allowed to impersonate xxx
代理問題,確保hadoop的core-site.xml的xxx有代理
<!--把XXX改成自己的用戶 -->
<property>
<name>hadoop.proxyuser.XXX.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.XXX.groups</name>
<value>*</value>
</property>
- 開啟sqoop的詳細異常日誌
sqoop:000> set option --name verbose --value true
- status job -n jobname時報錯:java.net.ConnectException: Call From xxx.xxx.xxx.xxx to 0.0.0.0:10020 failed on connection exception: java.net.ConnectException: Connection refuse
這個問題一般是在hadoop2.x版本裏會出現,Hadoop的datanode需要訪問namenode的jobhistory server,如果沒有修改,則默認為0.0.0.0:10020,則可以修改mapred-site.xml文件
<property>
<name>mapreduce.jobhistory.address</name>
<!-- 配置實際的Master主機名和端口-->
<value>localhost:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<!-- 配置實際的Master主機名和端口-->
<value>localhost:19888</value>
</property>
啟動jobhistory
$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver
- Error: Java heap space Container killed by the ApplicationMaster. Container killed on request. Exit code is 143.
確保mapredce有足夠的java內存,可以在mapreduce-site.xml中配置
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx1024m</value>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx1024m</value>
</property>
- Exception has occurred during processing command Exception: org.apache.sqoop.common.SqoopException Message: GENERIC_JDBC_CONNECTOR_0016:Can't fetch schema -
解決: 在創建mysql鏈接時,Identifier enclose:指定SQL中標識符的定界符,也就是說,有的SQL標示符是一個引號:select * from "table_name",這種定界符在MySQL中是會報錯的。這個屬性默認值就是雙引號,使用空格覆蓋這個值。
參考: https://www.cnblogs.com/avivaye/p/6197123.html
最後更新:2017-04-11 15:30:20