579
阿裏雲
技術社區[雲棲]
DataX配置及使用
一. DataX3.0概覽
DataX 是一個異構數據源離線同步工具,致力於實現包括關係型數據庫(MySQL、Oracle等)、HDFS、Hive、ODPS、HBase、FTP等各種異構數據源之間穩定高效的數據同步功能。

此前已經開源DataX1.0版本,此次介紹為阿裏巴巴開源全新版本DataX3.0,有了更多更強大的功能和更好的使用體驗。Github主頁地址:https://github.com/alibaba/DataX。
二、DataX3.0框架設計

DataX本身作為離線數據同步框架,采用Framework + plugin架構構建。將數據源讀取和寫入抽象成為Reader/Writer插件,納入到整個同步框架中。
- Reader:Reader為數據采集模塊,負責采集數據源的數據,將數據發送給Framework。
- Writer: Writer為數據寫入模塊,負責不斷向Framework取數據,並將數據寫入到目的端。
- Framework:Framework用於連接reader和writer,作為兩者的數據傳輸通道,並處理緩衝,流控,並發,數據轉換等核心技術問題。
三. DataX3.0插件體係
經過幾年積累,DataX目前已經有了比較全麵的插件體係,主流的RDBMS數據庫、NOSQL、大數據計算係統都已經接入。DataX目前支持數據如下:
類型 |
數據源 |
Reader(讀) |
Writer(寫) |
RDBMS 關係型數據庫 |
Mysql |
√ |
√ |
|
Oracle |
√ |
√ |
|
SqlServer |
√ |
√ |
|
Postgresql |
√ |
√ |
|
達夢 |
√ |
√ |
阿裏雲數倉數據存儲 |
ODPS |
√ |
√ |
|
ADS |
|
√ |
|
OSS |
√ |
√ |
|
OCS |
√ |
√ |
NoSQL數據存儲 |
OTS |
√ |
√ |
|
Hbase0.94 |
√ |
√ |
|
Hbase1.1 |
√ |
√ |
|
MongoDB |
√ |
√ |
無結構化數據存儲 |
TxtFile |
√ |
√ |
|
FTP |
√ |
√ |
|
HDFS |
√ |
√ |
DataX Framework提供了簡單的接口與插件交互,提供簡單的插件接入機製,隻需要任意加上一種插件,就能無縫對接其他數據源。詳情請看:DataX數據源指南
四、DataX3.0核心架構
DataX 3.0 開源版本支持單機多線程模式完成同步作業運行,本小節按一個DataX作業生命周期的時序圖,從整體架構設計非常簡要說明DataX各個模塊相互關係。

核心模塊介紹:
- DataX完成單個數據同步的作業,我們稱之為Job,DataX接受到一個Job之後,將啟動一個進程來完成整個作業同步過程。DataX Job模塊是單個作業的中樞管理節點,承擔了數據清理、子任務切分(將單一作業計算轉化為多個子Task)、TaskGroup管理等功能。
- DataXJob啟動後,會根據不同的源端切分策略,將Job切分成多個小的Task(子任務),以便於並發執行。Task便是DataX作業的最小單元,每一個Task都會負責一部分數據的同步工作。
- 切分多個Task之後,DataX Job會調用Scheduler模塊,根據配置的並發數據量,將拆分成的Task重新組合,組裝成TaskGroup(任務組)。每一個TaskGroup負責以一定的並發運行完畢分配好的所有Task,默認單個任務組的並發數量為5。
- 每一個Task都由TaskGroup負責啟動,Task啟動後,會固定啟動Reader—>Channel—>Writer的線程來完成任務同步工作。
- DataX作業運行起來之後, Job監控並等待多個TaskGroup模塊任務完成,等待所有TaskGroup任務完成後Job成功退出。否則,異常退出,進程退出值非0
DataX調度流程:
舉例來說,用戶提交了一個DataX作業,並且配置了20個並發,目的是將一個100張分表的mysql數據同步到odps裏麵。 DataX的調度決策思路是:
- DataXJob根據分庫分表切分成了100個Task。
- 根據20個並發,DataX計算共需要分配4個TaskGroup。
- 4個TaskGroup平分切分好的100個Task,每一個TaskGroup負責以5個並發共計運行25個Task。
五、DataX安裝:
(1)、下載DataX源碼:
[root@iZ2zeh44pi6rlahxj7s9azZ ~]# git clone git@github.com:alibaba/DataX.git
(2)、通過maven打包:
[root@iZ2zeh44pi6rlahxj7s9azZ ~]# cd {DataX_source_code_home}
[root@iZ2zeh44pi6rlahxj7s9azZ ~]# yum install -y maven
[root@iZ2zeh44pi6rlahxj7s9azZ ~]# mvn -U clean package assembly:assembly -Dmaven.test.skip=true
打包成功,日誌顯示如下:
[INFO] BUILD SUCCESS
[INFO] -----------------------------------------------------------------
[INFO] Total time: 08:12 min
[INFO] Finished at: 2015-12-13T16:26:48+08:00
[INFO] Final Memory: 133M/960M
[INFO] -----------------------------------------------------------------
打包成功後的DataX包位於 {DataX_source_code_home}/target/datax/datax/ ,結構如下:
[root@iZ2zeh44pi6rlahxj7s9azZ ~]# cd {DataX_source_code_home}
[root@iZ2zeh44pi6rlahxj7s9azZ ~]# ls ./target/datax/datax/
bin conf job lib log log_perf
六、mysql數據傳輸到oracle
(1)、生成mysql到oracle的配置文件:
[root@iZ2zeh44pi6rlahxj7s9azZ bin]# python datax.py -r mysqlreader -w oraclewriter
DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2016, Alibaba Group. All Rights Reserved.
Please refer to the mysqlreader document:
https://github.com/alibaba/DataX/blob/master/mysqlreader/doc/mysqlreader.md
Please refer to the oraclewriter document:
https://github.com/alibaba/DataX/blob/master/oraclewriter/doc/oraclewriter.md
Please save the following configuration as a json file and use
python {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json
to run the job.
{
"job": {
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"column": [],
"connection": [
{
"jdbcUrl": [],
"table": []
}
],
"password": "",
"username": "",
"where": ""
}
},
"writer": {
"name": "oraclewriter",
"parameter": {
"column": [],
"connection": [
{
"jdbcUrl": "",
"table": []
}
],
"password": "",
"preSql": [],
"username": ""
}
}
}
],
"setting": {
"speed": {
"channel": ""
}
}
}
}
(2)、傳輸文件配置
[root@iZ2zeh44pi6rlahxj7s9azZ bin]# vim ../job/myor.json
{
"job": {
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"column": ["id","qiye","diqu"],
"connection": [
{
"jdbcUrl": ["jdbc:mysql://[HOST_NAME]:PORT/[DATABASE_NAME]"],
"table": ["***"]
}
],
"password": "***",
"username": "***",
"where": ""
}
},
"writer": {
"name": "oraclewriter",
"parameter": {
"column": ["id","qiye","diqu"],
"connection": [
{
"jdbcUrl": "jdbc:oracle:thin:@[HOST_NAME]:PORT:[DATABASE_NAME]",
"table": ["***"]
}
],
"password": "***",
"preSql": ["delete from ceshi"],
"username": "***"
}
}
}
],
"setting": {
"speed": {
"channel": "1"
}
}
}
}
(3)、執行傳輸過程
[root@iZ2zeh44pi6rlahxj7s9azZ bin]# python datax.py ../job/myor.json
DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2016, Alibaba Group. All Rights Reserved.
2017-09-25 20:09:01.200 [main] INFO VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl
2017-09-25 20:09:01.215 [main] INFO Engine - the machine info =>
osInfo: Oracle Corporation 1.8 25.144-b01
jvmInfo: Linux amd64 3.10.0-514.6.2.el7.x86_64
cpu num: 1
totalPhysicalMemory: -0.00G
freePhysicalMemory: -0.00G
maxFileDescriptorCount: -1
currentOpenFileDescriptorCount: -1
.
.
.
2017-09-25 20:19:12.557 [job-0] INFO StandAloneJobContainerCommunicator - Total 1419776 records, 31164761 bytes | Speed 54.10KB/s, 2457 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 597.972s | All Task WaitReaderTime 1.983s | Percentage 0.00%
2017-09-25 20:19:32.558 [job-0] INFO StandAloneJobContainerCommunicator - Total 1464832 records, 32195809 bytes | Speed 50.34KB/s, 2252 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 617.847s | All Task WaitReaderTime 2.031s | Percentage 0.00%
2017-09-25 20:19:42.559 [job-0] INFO StandAloneJobContainerCommunicator - Total 1491456 records, 32744862 bytes | Speed 53.62KB/s, 2662 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 628.044s | All Task WaitReaderTime 2.054s | Percentage 0.00%
2017-09-25 20:19:52.561 [job-0] INFO StandAloneJobContainerCommunicator - Total 1516032 records, 33271617 bytes | Speed 51.44KB/s, 2457 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 637.787s | All Task WaitReaderTime 2.076s | Percentage 0.00%
(4)、登錄oracle驗證傳輸是否成功
未傳輸時:
[oracle@iz2zec57gfl6i9vbtdksl1z ~]$ sqlplus ***/***
SQL*Plus: Release 11.2.0.4.0 Production on Mon Sep 25 20:11:44 2017
Copyright (c) 1982, 2013, Oracle. All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, Oracle Label Security, OLAP, Data Mining,
Oracle Database Vault and Real Application Testing options
SQL> select * from ceshi;
no rows selected
傳輸後:
[oracle@iz2zec57gfl6i9vbtdksl1z ~]$ sqlplus ***/***
SQL*Plus: Release 11.2.0.4.0 Production on Mon Sep 25 20:15:28 2017
Copyright (c) 1982, 2013, Oracle. All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, Oracle Label Security, OLAP, Data Mining,
Oracle Database Vault and Real Application Testing options
SQL> select count(*) from ceshi;
COUNT(*)
----------
917504
最後更新:2017-09-25 21:03:41