利用yarn capacity scheduler在EMR集群上實現大集群的多租戶的集群資源quota限製與管控

背景

使用過hadoop的人基本都會考慮集群裏麵資源的調度和優先級的問題，假設你現在所在的公司有一個大hadoop的集群，有很多不同的業務組同時使用。但是A項目組經常做一些定時的BI報表，B項目組則經常使用一些軟件做一些臨時需求。那麼他們肯定會遇到同時提交任務的場景，這個時候到底如何分配資源滿足這兩個任務呢？是先執行A的任務，再執行B的任務，還是同時跑兩個？

目前一些使用EMR的大公司，會使用一個比較大的集群，來共公司內部不同業務組的人共同使用，使用大的集群一方麵比使用多個小集群，可以達到的性能極限更高，同時可以根據不同業務的峰穀情況可是獲得更高的資源利用裏，降低總成本，另外一個方麵，能夠更好地在不同的業務組之間實現資源的共享和數據的流動。下麵結合EMR集群，介紹一下如何進行大集群的資源quota管控。目前還需要使用EMR的作業調度的sdk來實現用戶的控製，後續EMR會將改功能產品化實現在EMR產品控製台上。

yarn默認提供了兩種調度規則，capacity scheduler和fair scheduler。現在使用比較多的是capacity scheduler。具體的實現原理和調度源碼可以google一下capacity scheduler。

什麼是capacity調度器

Capacity調度器說的通俗點，可以理解成一個個的資源隊列。這個資源隊列是用戶自己去分配的。比如我大體上把整個集群分成了AB兩個隊列，A隊列給A項目組的人來使用。B隊列給B項目組來使用。但是A項目組下麵又有兩個方向，那麼還可以繼續分，比如專門做BI的和做實時分析的。那麼隊列的分配就可以參考下麵的樹形結構

root
------default[20%]
------q1[60%]
      |---q1.q11[70%]
      |---q1.q12[30%]
------q2[20%]

整個集群的queue必須掛在root下麵。分成三個queue：default，q1和q2。三個隊列使用集群資源的quota配比為：20%，60%，20%。default這個queue是必須要存在的。

雖然有了這樣的資源分配，但是並不是說提交任務到q2裏麵，它就隻能使用20%的資源，即使剩下的80%都空閑著。它也是能夠實現（通過配置），隻要資源實在空閑狀態，那麼q2就可以使用100%的資源。但是一旦其它隊列提交了任務，q2就需要在釋放資源後，把資源還給其它隊列，知道大道預設計配比值。粗粒度上資源是按照上麵的方式進行，在每個隊列的內部，還是按照FIFO的原則來分配資源的。

capacity調度器特性

capacity調度器具有以下的幾個特性：

層次化的隊列設計，這種層次化的隊列設計保證了子隊列可以使用父隊列設置的全部資源。這樣通過層次化的管理，更容易合理分配和限製資源的使用。容量保證，隊列上都會設置一個資源的占比，這樣可以保證每個隊列都不會占用整個集群的資源。
安全，每個隊列又嚴格的訪問控製。用戶隻能向自己的隊列裏麵提交任務，而且不能修改或者訪問其他隊列的任務。
彈性分配，空閑的資源可以被分配給任何隊列。當多個隊列出現爭用的時候，則會按照比例進行平衡。
多租戶租用，通過隊列的容量限製，多個用戶就可以共享同一個集群，同事保證每個隊列分配到自己的容量，提高利用率。
操作性，yarn支持動態修改調整容量、權限等的分配，可以在運行時直接修改。還提供給管理員界麵，來顯示當前的隊列狀況。管理員可以在運行時，添加一個隊列；但是不能刪除一個隊列。管理員還可以在運行時暫停某個隊列，這樣可以保證當前的隊列在執行過程中，集群不會接收其他的任務。如果一個隊列被設置成了stopped，那麼就不能向他或者子隊列上提交任務了。

capacity調度器的配置

登錄EMR集群master節點，編輯配置文件：
/etc/emr/hadoop-conf/capacity-scheduler.xml
這裏先貼出來完整的比較複雜一點的配置，後麵再詳細說明：

        <?xml version="1.0" encoding="utf-8"?>

        <configuration> 
          <property> 
            <name>yarn.scheduler.capacity.maximum-applications</name>  
            <value>10000</value>  
            <description>Maximum number of applications that can be pending and running.</description> 
          </property>  
          <property> 
            <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>  
            <value>0.25</value>  
            <description>Maximum percent of resources in the cluster which can be used to run application masters i.e. controls number of concurrent running applications.</description> 
          </property>  
          <property> 
            <name>yarn.scheduler.capacity.resource-calculator</name>  
            <value>org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator</value>  
            <description>The ResourceCalculator implementation to be used to compare Resources in the scheduler. The default i.e. DefaultResourceCalculator only uses Memory while DominantResourceCalculator uses dominant-resource to compare multi-dimensional resources such as Memory, CPU etc.</description> 
          </property>  
          <property> 
            <name>yarn.scheduler.capacity.root.queues</name>  
            <value>default,q1,q2</value>  
            <description>The queues at the this level (root is the root queue).</description> 
          </property>  
          <property> 
            <name>yarn.scheduler.capacity.root.q1.queues</name>  
            <value>q11,q12</value>  
            <description>The queues at the this level (root is the root queue).</description> 
          </property>  
          <property> 
            <name>yarn.scheduler.capacity.root.default.capacity</name>  
            <value>20</value>  
            <description>Default queue target capacity.</description> 
          </property>  
          <property> 
            <name>yarn.scheduler.capacity.root.q1.capacity</name>  
            <value>60</value>  
            <description>Default queue target capacity.</description> 
          </property>  
          <property> 
            <name>yarn.scheduler.capacity.root.q2.capacity</name>  
            <value>20</value>  
            <description>Default queue target capacity.</description> 
          </property>  
          <property> 
            <name>yarn.scheduler.capacity.root.q1.q11.capacity</name>  
            <value>70</value>  
            <description>Default queue target capacity.</description> 
          </property>  
          <property> 
            <name>yarn.scheduler.capacity.root.q1.q11.maximum-capacity</name>  
            <value>90</value>  
            <description>Default queue target capacity.</description> 
          </property>  
          <property> 
            <name>yarn.scheduler.capacity.root.q1.q11.minimum-user-limit-percent</name>  
            <value>25</value>  
            <description>Default queue target capacity.</description> 
          </property>  
          <property> 
            <name>yarn.scheduler.capacity.root.q1.q12.capacity</name>  
            <value>30</value>  
            <description>Default queue target capacity.</description> 
          </property>  
          <property> 
            <name>yarn.scheduler.capacity.root.q1.q12.user-limit-factor</name>  
            <value>0.7</value>  
            <description>Default queue target capacity.</description> 
          </property>  
          <property> 
            <name>yarn.scheduler.capacity.root.q2.user-limit-factor</name>  
            <value>0.4</value>  
            <description>Default queue user limit a percentage from 0.0 to 1.0.</description> 
          </property>  
          <property> 
            <name>yarn.scheduler.capacity.root.q2.maximum-capacity</name>  
            <value>100</value>  
            <description>The maximum capacity of the default queue.</description> 
          </property>  
          <property> 
            <name>yarn.scheduler.capacity.root.q1.state</name>  
            <value>RUNNING</value>  
            <description>The state of the default queue. State can be one of RUNNING or STOPPED.</description> 
          </property>  
          <property> 
            <name>yarn.scheduler.capacity.root.q2.state</name>  
            <value>RUNNING</value>  
            <description>The state of the default queue. State can be one of RUNNING or STOPPED.</description> 
          </property>  
          <property> 
            <name>yarn.scheduler.capacity.root.q1.acl_submit_applications</name>  
            <value>*</value>  
            <description>The ACL of who can submit jobs to the default queue.</description> 
          </property>  
          <property> 
            <name>yarn.scheduler.capacity.root.q2.acl_submit_applications</name>  
            <value>*</value>  
            <description>The ACL of who can submit jobs to the default queue.</description> 
          </property>  
          <property> 
            <name>yarn.scheduler.capacity.root.q1.acl_administer_queue</name>  
            <value>*</value>  
            <description>The ACL of who can administer jobs on the default queue.</description> 
          </property>  
          <property> 
            <name>yarn.scheduler.capacity.root.q2.acl_administer_queue</name>  
            <value>*</value>  
            <description>The ACL of who can administer jobs on the default queue.</description> 
          </property>  
          <property> 
            <name>yarn.scheduler.capacity.node-locality-delay</name>  
            <value>40</value>  
            <description>Number of missed scheduling opportunities after which the CapacityScheduler attempts to schedule rack-local containers. Typically this should be set to number of nodes in the cluster, By default is setting approximately number of nodes in one rack which is 40.</description> 
          </property>  
          <property> 
            <name>yarn.scheduler.capacity.queue-mappings</name>  
            <value/>  
            <description>A list of mappings that will be used to assign jobs to queues The syntax for this list is [u|g]:[name]:[queue_name][,next mapping]* Typically this list will be used to map users to queues, for example, u:%user:%user maps all users to queues with the same name as the user.</description> 
          </property>  
          <property> 
            <name>yarn.scheduler.capacity.queue-mappings-override.enable</name>  
            <value>false</value>  
            <description>If a queue mapping is present, will it override the value specified by the user? This can be used by administrators to place jobs in queues that are different than the one specified by the user. The default is false.</description> 
          </property> 
        </configuration>

在hadoop帳號下，執行命令說新一下queue配置：
/usr/lib/hadoop-current/bin/yarn rmadmin -refreshQueues

然後打開hadoop ui，就能看到效果：

參數說明

隊列屬性：
yarn.scheduler.capacity.${quere-path}.capacity

例如下麵的配置，表示共三個大的queue，default,q1,q2，q1下麵又分了兩個queue q11,q12。

<property>
<name>yarn.scheduler.capacity.root.queues</name>
<value>default,q1,q2</value>
<description>
  The queues at the this level (root is the root queue).
</description>
</property>

<property>
<name>yarn.scheduler.capacity.root.q1.queues</name>
<value>q11,q12</value>
<description>
  The queues at the this level (root is the root queue).
</description>
</property>

下麵的配置設置每個queue的資源quota配比：

<property>
<name>yarn.scheduler.capacity.root.default.capacity</name>
<value>20</value>
<description>Default queue target capacity.</description>
</property>

<property>
<name>yarn.scheduler.capacity.root.q1.capacity</name>
<value>60</value>
<description>Default queue target capacity.</description>
</property>

<property>
<name>yarn.scheduler.capacity.root.q2.capacity</name>
<value>20</value>
<description>Default queue target capacity.</description>
</property>

設置q1大隊列下麵的兩個小隊列的資源quota配比：

<property>
<name>yarn.scheduler.capacity.root.q1.q11.capacity</name>
<value>70</value>
<description>Default queue target capacity.</description>
</property>

<property>
<name>yarn.scheduler.capacity.root.q1.q12.capacity</name>
<value>30</value>
<description>Default queue target capacity.</description>
</property>

下麵的配置表示，q1.q11這個queue，能夠使用q1所有資源的最大比例，q1.q11分配的資源原本是占q1總資源的70%（見上麵配置），但是如果q1裏麵沒有其它作業在跑，都是空閑的，那麼q11是可以使用到q1總資源的90%。但是如果q1裏麵沒有這麼多空閑資源，q11
隻能使用到q1總資源的70%：
```
<property>
<name>yarn.scheduler.capacity.root.q1.q11.maximum-capacity</name>
<value>90</value>
<description>Default queue target capacity.</description>
</property>
```
下麵的配置表示q1.q11裏麵的作業的保底資源要有25%，意思是說，q1下麵的總資源至少還要剩餘25%，q11裏麵的作業才能提上來，如果q1下麵的總資源已經小於25%了，那麼往q11裏麵提作業就要等待：
```
<property>
<name>yarn.scheduler.capacity.root.q1.q11.minimum-user-limit-percent</name>
<value>25</value>
<description>Default queue target capacity.</description>
</property>
```

另外一些比較重要的ACL配置：

yarn.scheduler.capacity.root.q1.acl_submit_applications 表示哪些user/group可以往q1裏麵提作業；
yarn.scheduler.capacity.queue-mappings 這個比較強大，可以設定user/group和queue的映射關係，格式為[u|g]:[name]:[queue_name][,next mapping]*

操作示例

本文使用的EMR集群配置：

注意，如果不指定queue，則默認往default提交：

hadoop jar /usr/lib/hadoop-current/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar randomtextwriter -D mapreduce.randomtextwriter.totalbytes=100000000000 /HiBench/Wordcount/Input

如果指定一個不存在的queue，則會報錯

hadoop jar /usr/lib/hadoop-current/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar randomtextwriter -D mapreduce.randomtextwriter.totalbytes=100000000000 -D mapreduce.job.queuename=notExist /HiBench/Wordcount/Input3

Caused by: org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit application_1494398749894_0010 to YARN : Application application_1494398749894_0010 submitted by user hadoop to unknown queue: notExist

再往default裏麵提交job，發現作業等待，因為default的20% quota已經用完。

然後往q2裏麵提交一個作業： hadoop jar /usr/lib/hadoop-current/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar randomtextwriter -D mapreduce.randomtextwriter.totalbytes=100000000000 -D mapreduce.job.queuename=q2 /HiBench/Wordcount/Input3

設置了q2單個作業不能超過q2總隊列資源的40%，可以看到隻用到50%多，盡管還有其它map在等待，也是不會占用更大資源。當然，這裏不是非常精確，跟集群總體配置和每個map占用的資源有關，一個map占用的資源可以看成最小單元，一個map占用的資源不一定正好到達設定的比例，有可能會超過一點。

往q1.q12裏麵提交job hadoop jar /usr/lib/hadoop-current/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar randomtextwriter -D mapreduce.randomtextwriter.totalbytes=100000000000 -D mapreduce.job.queuename=q12 /HiBench/Wordcount/Input4

可以看到，q12也設置了yarn.scheduler.capacity.root.q1.q12.user-limit-factor為0.7，可以看到隻占用了q12的78%。

往q11裏麵提交一個作業，會發現它直接占滿了q11，是因為q11設置了yarn.scheduler.capacity.root.q1.q11.maximum-capacity為90%，即q11能占到q1總資源的90%。 hadoop jar /usr/lib/hadoop-current/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar randomtextwriter -D mapreduce.randomtextwriter.totalbytes=100000000000 -D mapreduce.job.queuename=q11 /HiBench/Wordcount/Input5

通過上麵的例子，可以看到yarn的capacity scheduler還是很強大的，可以實現對集群資源的控製和優先級調度。以上例子中使用了hadoop作業，指定queue的參數為-D mapreduce.job.queuename=${queue-name}，如果是spark作業，可以用--queue ${queue-name}指定。以上作業的提交方式可以直接在EMR控製台的作業中設置。

最後更新：2017-05-10 19:32:00

利用yarn capacity scheduler在EMR集群上實現大集群的多租戶的集群資源quota限製與管控

背景

什麼是capacity調度器

capacity調度器特性

capacity調度器的配置

參數說明

另外一些比較重要的ACL配置：

操作示例

上一篇：移動時代，H5響應式建站將成為主流

下一篇：《雲周刊》第121期：機器學習PAI眼中的《人民的名義》

相關內容

熱門內容

最新內容

利用yarn capacity scheduler在EMR集群上實現大集群的多租戶的集群資源quota限製與管控

背景

什麼是capacity調度器

capacity調度器特性

capacity調度器的配置

參數說明

另外一些比較重要的ACL配置：

操作示例

上一篇： 移動時代，H5響應式建站將成為主流

下一篇： 《雲周刊》第121期：機器學習PAI眼中的《人民的名義》

相關內容

熱門內容

最新內容

上一篇：移動時代，H5響應式建站將成為主流

下一篇：《雲周刊》第121期：機器學習PAI眼中的《人民的名義》