閱讀391 返回首頁    go 阿裏雲


網絡分析__使用手冊(new)_機器學習-阿裏雲


目錄


網絡分析欄提供的都是基於Graph數據結構的分析算法;下圖是使用平台網絡分析組件構建的一個分析流程實例:

ds

網絡分析欄的算法組件都需要設置運行參數,參數說明如下:進程數:參數代號workerNum,用於設置作業並行執行的節點數;數字越大並行度越高,但框架通訊開銷會增大。進程內存:參數代號workerMem,用於設置單個 worker可使用的最大內存量,默認每個worker分配4096內存;實際使用內存超過該值,會拋出OutOfMemory異常。

k-Core

功能介紹

  • 一個圖的KCore是指反複去除度小於或等於k的節點後,所剩餘的子圖。若一個節點存在於KCore,而在(K+1)CORE中被移去,那麼此節點的核數(coreness)為k。因此所有度為1的節點的核數必然為0,節點核數的最大值被稱為圖的核數。

參數設置

k:核數的值,必填,默認3

實例

測試數據

新建數據SQL

  1. drop table if exists KCore_func_test_edge;
  2. create table KCore_func_test_edge as
  3. select * from
  4. (
  5. select '1' as flow_out_id,'2' as flow_in_id from dual
  6. union all
  7. select '1' as flow_out_id,'3' as flow_in_id from dual
  8. union all
  9. select '1' as flow_out_id,'4' as flow_in_id from dual
  10. union all
  11. select '2' as flow_out_id,'3' as flow_in_id from dual
  12. union all
  13. select '2' as flow_out_id,'4' as flow_in_id from dual
  14. union all
  15. select '3' as flow_out_id,'4' as flow_in_id from dual
  16. union all
  17. select '3' as flow_out_id,'5' as flow_in_id from dual
  18. union all
  19. select '3' as flow_out_id,'6' as flow_in_id from dual
  20. union all
  21. select '5' as flow_out_id,'6' as flow_in_id from dual
  22. )tmp;

數據對應的graph結構如下圖:graph

運行結果

設定k = 2:運行結果:結果如下:

  1. +-------+-------+
  2. | node1 | node2 |
  3. +-------+-------+
  4. | 1 | 2 |
  5. | 1 | 3 |
  6. | 1 | 4 |
  7. | 2 | 1 |
  8. | 2 | 3 |
  9. | 2 | 4 |
  10. | 3 | 1 |
  11. | 3 | 2 |
  12. | 3 | 4 |
  13. | 4 | 1 |
  14. | 4 | 2 |
  15. | 4 | 3 |
  16. +-------+-------+

pai命令示例

  1. pai -name KCore
  2. -project algo_public
  3. -DinputEdgeTableName=KCore_func_test_edge
  4. -DfromVertexCol=flow_out_id
  5. -DtoVertexCol=flow_in_id
  6. -DoutputTableName=KCore_func_test_result
  7. -Dk=2;

算法參數

參數key名稱 參數描述 必/選填 默認值
inputEdgeTableName 輸入邊表名 必填 -
inputEdgeTablePartitions 輸入邊表的分區 選填 全表讀入
fromVertexCol 邊表中起點所在列 必填 -
toVertexCol 邊表中終點所在列 必填 -
outputTableName 輸出表名 必填 -
outputTablePartitions 輸出表的分區 選填 -
lifecycle 輸出表申明周期 選填 -
workerNum 進程數量 選填 未設置
workerMem 進程內存 選填 4096
splitSize 數據切分大小 選填 64
k 核數 必填 3

單源最短路徑

功能介紹

  • 單源最短路徑參考Dijkstra算法,本算法中當給定起點,則輸出該點和其他所有節點的最短路徑。

參數設置

起始節點id:用於計算最短路徑的起始節點,必填

實例

測試數據

新建數據的SQL語句:

  1. drop table if exists SSSP_func_test_edge;
  2. create table SSSP_func_test_edge as
  3. select
  4. flow_out_id,flow_in_id,edge_weight
  5. from
  6. (
  7. select "a" as flow_out_id,"b" as flow_in_id,1.0 as edge_weight from dual
  8. union all
  9. select "b" as flow_out_id,"c" as flow_in_id,2.0 as edge_weight from dual
  10. union all
  11. select "c" as flow_out_id,"d" as flow_in_id,1.0 as edge_weight from dual
  12. union all
  13. select "b" as flow_out_id,"e" as flow_in_id,2.0 as edge_weight from dual
  14. union all
  15. select "e" as flow_out_id,"d" as flow_in_id,1.0 as edge_weight from dual
  16. union all
  17. select "c" as flow_out_id,"e" as flow_in_id,1.0 as edge_weight from dual
  18. union all
  19. select "f" as flow_out_id,"g" as flow_in_id,3.0 as edge_weight from dual
  20. union all
  21. select "a" as flow_out_id,"d" as flow_in_id,4.0 as edge_weight from dual
  22. ) tmp
  23. ;

數據對應的graph結構:images

運行結果
  1. 結果如下:
  2. +------------+------------+------------+--------------+
  3. | start_node | dest_node | distance | distance_cnt |
  4. +------------+------------+------------+--------------+
  5. | a | b | 1.0 | 1 |
  6. | a | c | 3.0 | 1 |
  7. | a | d | 4.0 | 3 |
  8. | a | a | 0.0 | 0 |
  9. | a | e | 3.0 | 1 |
  10. +------------+------------+------------+--------------+

pai命令示例

  1. pai -name SSSP
  2. -project algo_public
  3. -DinputEdgeTableName=SSSP_func_test_edge
  4. -DfromVertexCol=flow_out_id
  5. -DtoVertexCol=flow_in_id
  6. -DoutputTableName=SSSP_func_test_result
  7. -DhasEdgeWeight=true
  8. -DedgeWeightCol=edge_weight
  9. -DstartVertex=a;

算法參數

參數key名稱 參數描述 必/選填 默認值
inputEdgeTableName 輸入邊表名 必填 -
inputEdgeTablePartitions 輸入邊表的分區 選填 全表讀入
fromVertexCol 輸入邊表的起點所在列 必填 -
toVertexCol 輸入邊表的終點所在列 必填 -
outputTableName 輸出表名 必填 -
outputTablePartitions 輸出表的分區 選填 -
lifecycle 輸出表申明周期 選填 -
workerNum 進程數量 選填 未設置
workerMem 進程內存 選填 4096
splitSize 數據切分大小 選填 64
startVertex 起始節點ID 必填 -
hasEdgeWeight 輸入邊表的邊是否有權重 選填 false
edgeWeightCol 輸入邊表邊的權重所在列 選填 -

PageRank

功能介紹

  • PageRank起於網頁的搜索排序,google利用網頁的鏈接結構計算每個網頁的等級排名,其基本思路是:如果一個網頁被其他多個網頁指向,這說明該網頁比較重要或者質量較高。除考慮網頁的鏈接數量,還考慮網頁本身的權重級別,以及該網頁有多少條出鏈到其它網頁。 對於用戶構成的人際網絡,除了用戶本身的影響力之外,邊的權重也是重要因素之一。例如:新浪微博的某個用戶,會更容易影響粉絲中關係比較親密的家人、同學、同事等,而對陌生的弱關係粉絲影響較小。在人際網絡中,邊的權重等價為用戶-用戶的關係強弱指數。帶連接權重的PageRank公式為:gongshi其中,w(i)為節點i的權重,c(A,i)為鏈接權重,d為阻尼係數,算法迭代穩定後的節點權重W即為每個用戶的影響力指數。

參數設置

最大迭代次數:算法自身會收斂並停止迭代,選填,默認30

實例

測試數據

新建數據的SQL語句:

  1. drop table if exists PageRankWithWeight_func_test_edge;
  2. create table PageRankWithWeight_func_test_edge as
  3. select * from
  4. (
  5. select 'a' as flow_out_id,'b' as flow_in_id,1.0 as weight from dual
  6. union all
  7. select 'a' as flow_out_id,'c' as flow_in_id,1.0 as weight from dual
  8. union all
  9. select 'b' as flow_out_id,'c' as flow_in_id,1.0 as weight from dual
  10. union all
  11. select 'b' as flow_out_id,'d' as flow_in_id,1.0 as weight from dual
  12. union all
  13. select 'c' as flow_out_id,'d' as flow_in_id,1.0 as weight from dual
  14. )tmp
  15. ;

對應的graph結構:pagerank

運行結果
  1. 結果如下:
  2. +------+------------+
  3. | node | weight |
  4. +------+------------+
  5. | a | 0.0375 |
  6. | b | 0.06938 |
  7. | c | 0.12834 |
  8. | d | 0.20556 |
  9. +------+------------+

pai命令示例

  1. pai -name PageRankWithWeight
  2. -project algo_public
  3. -DinputEdgeTableName=PageRankWithWeight_func_test_edge
  4. -DfromVertexCol=flow_out_id
  5. -DtoVertexCol=flow_in_id
  6. -DoutputTableName=PageRankWithWeight_func_test_result
  7. -DhasEdgeWeight=true
  8. -DedgeWeightCol=weight
  9. -DmaxIter 100;

算法參數

參數key名稱 參數描述 必/選填 默認值
inputEdgeTableName 輸入邊表名 必填 -
inputEdgeTablePartitions 輸入邊表的分區 選填 全表讀入
fromVertexCol 輸入邊表的起點所在列 必填 -
toVertexCol 輸入邊表的終點所在列 必填 -
outputTableName 輸出表名 必填 -
outputTablePartitions 輸出表的分區 選填 -
lifecycle 輸出表申明周期 選填 -
workerNum 進程數量 選填 未設置
workerMem 進程內存 選填 4096
splitSize 數據切分大小 選填 64
hasEdgeWeight 輸入邊表的邊是否有權重 選填 false
edgeWeightCol 輸入邊表邊的權重所在列 選填 -
maxIter 最大迭代次數 選填 30

標簽傳播聚類

功能介紹

圖聚類是根據圖的拓撲結構,進行子圖的劃分,使得子圖內部節點的鏈接較多,子圖之間的連接較少。標簽傳播算法(Label Propagation Algorithm, LPA)是基於圖的半監督學習方法,其基本思路是節點的標簽(community)依賴其鄰居節點的標簽信息,影響程度由節點相似度決定,並通過傳播迭代更新達到穩定。

參數介紹

最大迭代次數:選填,默認30

實例

測試數據

數據生成SQL:

  1. drop table if exists LabelPropagationClustering_func_test_edge;
  2. create table LabelPropagationClustering_func_test_edge as
  3. select * from
  4. (
  5. select '1' as flow_out_id,'2' as flow_in_id,0.7 as edge_weight from dual
  6. union all
  7. select '1' as flow_out_id,'3' as flow_in_id,0.7 as edge_weight from dual
  8. union all
  9. select '1' as flow_out_id,'4' as flow_in_id,0.6 as edge_weight from dual
  10. union all
  11. select '2' as flow_out_id,'3' as flow_in_id,0.7 as edge_weight from dual
  12. union all
  13. select '2' as flow_out_id,'4' as flow_in_id,0.6 as edge_weight from dual
  14. union all
  15. select '3' as flow_out_id,'4' as flow_in_id,0.6 as edge_weight from dual
  16. union all
  17. select '4' as flow_out_id,'6' as flow_in_id,0.3 as edge_weight from dual
  18. union all
  19. select '5' as flow_out_id,'6' as flow_in_id,0.6 as edge_weight from dual
  20. union all
  21. select '5' as flow_out_id,'7' as flow_in_id,0.7 as edge_weight from dual
  22. union all
  23. select '5' as flow_out_id,'8' as flow_in_id,0.7 as edge_weight from dual
  24. union all
  25. select '6' as flow_out_id,'7' as flow_in_id,0.6 as edge_weight from dual
  26. union all
  27. select '6' as flow_out_id,'8' as flow_in_id,0.6 as edge_weight from dual
  28. union all
  29. select '7' as flow_out_id,'8' as flow_in_id,0.7 as edge_weight from dual
  30. )tmp
  31. ;
  32. drop table if exists LabelPropagationClustering_func_test_node;
  33. create table LabelPropagationClustering_func_test_node as
  34. select * from
  35. (
  36. select '1' as node,0.7 as node_weight from dual
  37. union all
  38. select '2' as node,0.7 as node_weight from dual
  39. union all
  40. select '3' as node,0.7 as node_weight from dual
  41. union all
  42. select '4' as node,0.5 as node_weight from dual
  43. union all
  44. select '5' as node,0.7 as node_weight from dual
  45. union all
  46. select '6' as node,0.5 as node_weight from dual
  47. union all
  48. select '7' as node,0.7 as node_weight from dual
  49. union all
  50. select '8' as node,0.7 as node_weight from dual
  51. )tmp
  52. ;

數據對應的group結構:ddd

運行結果

結果如下:

  1. +------+------------+
  2. | node | group_id |
  3. +------+------------+
  4. | 1 | 1 |
  5. | 2 | 1 |
  6. | 3 | 1 |
  7. | 4 | 1 |
  8. | 5 | 5 |
  9. | 6 | 5 |
  10. | 7 | 5 |
  11. | 8 | 5 |
  12. +------+------------+

pai命令示例

  1. pai -name LabelPropagationClustering
  2. -project algo_public
  3. -DinputEdgeTableName=LabelPropagationClustering_func_test_edge
  4. -DfromVertexCol=flow_out_id
  5. -DtoVertexCol=flow_in_id
  6. -DinputVertexTableName=LabelPropagationClustering_func_test_node
  7. -DvertexCol=node
  8. -DoutputTableName=LabelPropagationClustering_func_test_result
  9. -DhasEdgeWeight=true
  10. -DedgeWeightCol=edge_weight
  11. -DhasVertexWeight=true
  12. -DvertexWeightCol=node_weight
  13. -DrandSelect=true
  14. -DmaxIter=100;

算法參數

參數key名稱 參數描述 必/選填 默認值
inputEdgeTableName 輸入邊表名 必填 -
inputEdgeTablePartitions 輸入邊表的分區 選填 全表讀入
fromVertexCol 輸入邊表的起點所在列 必填 -
toVertexCol 輸入邊表的終點所在列 必填 -
inputVertexTableName 輸入點表名稱 必填 -
inputVertexTablePartitions 輸入點表的分區 選填 全表讀入
vertexCol 輸入點表的點所在列 必填 -
outputTableName 輸出表名 必填 -
outputTablePartitions 輸出表的分區 選填 -
lifecycle 輸出表申明周期 選填 -
workerNum 進程數量 選填 未設置
workerMem 進程內存 選填 4096
splitSize 數據切分大小 選填 64
hasEdgeWeight 輸入邊表的邊是否有權重 選填 false
edgeWeightCol 輸入邊表邊的權重所在列 選填 -
hasVertexWeight 輸入點表的點是否有權重 選填 false
vertexWeightCol 輸入點表的點的權重所在列 選填 -
randSelect 是否隨機選擇最大標簽 選填 false
maxIter 最大迭代次數 選填 30

標簽傳播分類

功能介紹

該算法為半監督的分類算法,原理為用已標記節點的標簽信息去預測未標記節點的標簽信息。

在算法執行過程中,每個節點的標簽按相似度傳播給相鄰節點,在節點傳播的每一步,每個節點根據相鄰節點的標簽來更新自己的標簽,與該節點相似度越大,其相鄰節點對其標注的影響權值越大,相似節點的標簽越趨於一致,其標簽就越容易傳播。在標簽傳播過程中,保持已標注數據的標簽不變,使其像一個源頭把標簽傳向未標注數據。

最終,當迭代過程結束時,相似節點的概率分布也趨於相似,可以劃分到同一個類別中,從而完成標簽傳播過程

參數設置

阻尼係數:默認0.8收斂係數:默認0.000001

實例

測試數據

生成數據的SQL:

  1. drop table if exists LabelPropagationClassification_func_test_edge;
  2. create table LabelPropagationClassification_func_test_edge as
  3. select * from
  4. (
  5. select 'a' as flow_out_id, 'b' as flow_in_id, 0.2 as edge_weight from dual
  6. union all
  7. select 'a' as flow_out_id, 'c' as flow_in_id, 0.8 as edge_weight from dual
  8. union all
  9. select 'b' as flow_out_id, 'c' as flow_in_id, 1.0 as edge_weight from dual
  10. union all
  11. select 'd' as flow_out_id, 'b' as flow_in_id, 1.0 as edge_weight from dual
  12. )tmp
  13. ;
  14. drop table if exists LabelPropagationClassification_func_test_node;
  15. create table LabelPropagationClassification_func_test_node as
  16. select * from
  17. (
  18. select 'a' as node,'X' as label, 1.0 as label_weight from dual
  19. union all
  20. select 'd' as node,'Y' as label, 1.0 as label_weight from dual
  21. )tmp
  22. ;

對應的圖結構:ddd

運行結果
  1. 結果如下:
  2. +------+-----+------------+
  3. | node | tag | weight |
  4. +------+-----+------------+
  5. | a | X | 1.0 |
  6. | b | X | 0.16667 |
  7. | b | Y | 0.83333 |
  8. | c | X | 0.53704 |
  9. | c | Y | 0.46296 |
  10. | d | Y | 1.0 |
  11. +------+-----+------------+

pai命令示例

  1. pai -name LabelPropagationClassification
  2. -project algo_public
  3. -DinputEdgeTableName=LabelPropagationClassification_func_test_edge
  4. -DfromVertexCol=flow_out_id
  5. -DtoVertexCol=flow_in_id
  6. -DinputVertexTableName=LabelPropagationClassification_func_test_node
  7. -DvertexCol=node
  8. -DvertexLabelCol=label
  9. -DoutputTableName=LabelPropagationClassification_func_test_result
  10. -DhasEdgeWeight=true
  11. -DedgeWeightCol=edge_weight
  12. -DhasVertexWeight=true
  13. -DvertexWeightCol=label_weight
  14. -Dalpha=0.8
  15. -Depsilon=0.000001;

算法參數

參數key名稱 參數描述 必/選填 默認值
inputEdgeTableName 輸入邊表名 必填 -
inputEdgeTablePartitions 輸入邊表的分區 選填 全表讀入
fromVertexCol 輸入邊表的起點所在列 必填 -
toVertexCol 輸入邊表的終點所在列 必填 -
inputVertexTableName 輸入點表名稱 必填 -
inputVertexTablePartitions 輸入點表的分區 選填 全表讀入
vertexCol 輸入點表的點所在列 必填 -
vertexLabelCol 輸入點表的點的標簽 必填 -
outputTableName 輸出表名 必填 -
outputTablePartitions 輸出表的分區 選填 -
lifecycle 輸出表申明周期 選填 -
workerNum 進程數量 選填 未設置
workerMem 進程內存 選填 4096
splitSize 數據切分大小 選填 64
hasEdgeWeight 輸入邊表的邊是否有權重 選填 false
edgeWeightCol 輸入邊表邊的權重所在列 選填 -
hasVertexWeight 輸入點表的點是否有權重 選填 false
vertexWeightCol 輸入點表的點的權重所在列 選填 -
alpha 阻尼係數 選填 0.8
epsilon 收斂係數 選填 0.000001
maxIter 最大迭代次數 選填 30

Modularity

功能介紹

  • Modularity是一種評估社區網絡結構的指標,來評估網絡結構中劃分出來社區的緊密程度,往往0.3以上是比較明顯的社區結構。

實例

測試數據

略(與標簽傳播聚類算法的數據相同)

運行結果
  1. 結果如下:
  2. +--------------+
  3. | val |
  4. +--------------+
  5. | 0.4230769 |
  6. +--------------+

pai命令示例

  1. pai -name Modularity
  2. -project algo_public
  3. -DinputEdgeTableName=Modularity_func_test_edge
  4. -DfromVertexCol=flow_out_id
  5. -DfromGroupCol=group_out_id
  6. -DtoVertexCol=flow_in_id
  7. -DtoGroupCol=group_in_id
  8. -DoutputTableName=Modularity_func_test_result;

算法參數

參數key名稱 參數描述 必/選填 默認值
inputEdgeTableName 輸入邊表名 必填 -
inputEdgeTablePartitions 輸入邊表的分區 選填 全表讀入
fromVertexCol 輸入邊表的起點所在列 必填 -
fromGroupCol 輸入邊表起點的群組 必填 -
toVertexCol 輸入邊表的終點所在列 必填 -
toGroupCol 輸入邊表終點的群組 必填 -
outputTableName 輸出表名 必填 -
outputTablePartitions 輸出表的分區 選填 -
lifecycle 輸出表申明周期 選填 -
workerNum 進程數量 選填 未設置
workerMem 進程內存 選填 4096
splitSize 數據切分大小 選填 64

最大聯通子圖

功能介紹

在無向圖G中,若從頂點A到頂點B有路徑相連,則稱A和B是連通的;在圖G種存在若幹子圖,其中每個子圖中所有頂點之間都是連通的,但在不同子圖間不存在頂點連通,那麼稱圖G的這些子圖為最大連通子圖。

參數設置

實例

測試數據

生成數據的SQL:

  1. drop table if exists MaximalConnectedComponent_func_test_edge;
  2. create table MaximalConnectedComponent_func_test_edge as
  3. select * from
  4. (
  5. select '1' as flow_out_id,'2' as flow_in_id from dual
  6. union all
  7. select '2' as flow_out_id,'3' as flow_in_id from dual
  8. union all
  9. select '3' as flow_out_id,'4' as flow_in_id from dual
  10. union all
  11. select '1' as flow_out_id,'4' as flow_in_id from dual
  12. union all
  13. select 'a' as flow_out_id,'b' as flow_in_id from dual
  14. union all
  15. select 'b' as flow_out_id,'c' as flow_in_id from dual
  16. )tmp;
  17. drop table if exists MaximalConnectedComponent_func_test_result;
  18. create table MaximalConnectedComponent_func_test_result
  19. (
  20. node string,
  21. grp_id string
  22. );

對應的圖結構:Snip20160228_11

運行結果
  1. 結果如下:
  2. +-------+-------+
  3. | node | grp_id|
  4. +-------+-------+
  5. | 1 | 4 |
  6. | 2 | 4 |
  7. | 3 | 4 |
  8. | 4 | 4 |
  9. | a | c |
  10. | b | c |
  11. | c | c |
  12. +-------+-------+

pai命令示例

  1. pai -name MaximalConnectedComponent
  2. -project algo_public
  3. -DinputEdgeTableName=MaximalConnectedComponent_func_test_edge
  4. -DfromVertexCol=flow_out_id
  5. -DtoVertexCol=flow_in_id
  6. -DoutputTableName=MaximalConnectedComponent_func_test_result;

算法參數

參數key名稱 參數描述 必/選填 默認值
inputEdgeTableName 輸入邊表名 必填 -
inputEdgeTablePartitions 輸入邊表的分區 選填 全表讀入
fromVertexCol 輸入邊表的起點所在列 必填 -
toVertexCol 輸入邊表的終點所在列 必填 -
outputTableName 輸出表名 必填 -
outputTablePartitions 輸出表的分區 選填 -
lifecycle 輸出表申明周期 選填 -
workerNum 進程數量 選填 未設置
workerMem 進程內存 選填 4096
splitSize 數據切分大小 選填 64

點聚類係數

功能介紹

在無向圖G中,計算每一個節點周圍的稠密度,星狀網絡稠密度為0,全聯通網絡稠密度為1。

參數設置

maxEdgeCnt:若節點度大於該值,則進行抽樣,默認500,選填。

實例

測試數據

生成數據的SQL:

  1. drop table if exists NodeDensity_func_test_edge;
  2. create table NodeDensity_func_test_edge as
  3. select * from
  4. (
  5. select '1' as flow_out_id, '2' as flow_in_id from dual
  6. union all
  7. select '1' as flow_out_id, '3' as flow_in_id from dual
  8. union all
  9. select '1' as flow_out_id, '4' as flow_in_id from dual
  10. union all
  11. select '1' as flow_out_id, '5' as flow_in_id from dual
  12. union all
  13. select '1' as flow_out_id, '6' as flow_in_id from dual
  14. union all
  15. select '2' as flow_out_id, '3' as flow_in_id from dual
  16. union all
  17. select '3' as flow_out_id, '4' as flow_in_id from dual
  18. union all
  19. select '4' as flow_out_id, '5' as flow_in_id from dual
  20. union all
  21. select '5' as flow_out_id, '6' as flow_in_id from dual
  22. union all
  23. select '5' as flow_out_id, '7' as flow_in_id from dual
  24. union all
  25. select '6' as flow_out_id, '7' as flow_in_id from dual
  26. )tmp;
  27. drop table if exists NodeDensity_func_test_result;
  28. create table NodeDensity_func_test_result
  29. (
  30. node string,
  31. node_cnt bigint,
  32. edge_cnt bigint,
  33. density double,
  34. log_density double
  35. );

對應的圖結構:Snip20160228_12

運行結果
  1. 結果如下:
  2. 1,5,4,0.4,1.45657
  3. 2,2,1,1.0,1.24696
  4. 3,3,2,0.66667,1.35204
  5. 4,3,2,0.66667,1.35204
  6. 5,4,3,0.5,1.41189
  7. 6,3,2,0.66667,1.35204
  8. 7,2,1,1.0,1.24696

pai命令示例

  1. pai -name NodeDensity
  2. -project algo_public
  3. -DinputEdgeTableName=NodeDensity_func_test_edge
  4. -DfromVertexCol=flow_out_id
  5. -DtoVertexCol=flow_in_id
  6. -DoutputTableName=NodeDensity_func_test_result
  7. -DmaxEdgeCnt=500;

算法參數

參數key名稱 參數描述 必/選填 默認值
inputEdgeTableName 輸入邊表名 必填 -
inputEdgeTablePartitions 輸入邊表的分區 選填 全表讀入
fromVertexCol 輸入邊表的起點所在列 必填 -
toVertexCol 輸入邊表的終點所在列 必填 -
outputTableName 輸出表名 必填 -
outputTablePartitions 輸出表的分區 選填 -
lifecycle 輸出表申明周期 選填 -
maxEdgeCnt 若節點度大於該值,則進行抽樣。 選填 500
workerNum 進程數量 選填 未設置
workerMem 進程內存 選填 4096
splitSize 數據切分大小 選填 64

邊聚類係數

功能介紹

在無向圖G中,計算每一條邊周圍的稠密度。

參數設置

實例

測試數據

生成數據的SQL:

  1. drop table if exists EdgeDensity_func_test_edge;
  2. create table EdgeDensity_func_test_edge as
  3. select * from
  4. (
  5. select '1' as flow_out_id,'2' as flow_in_id from dual
  6. union all
  7. select '1' as flow_out_id,'3' as flow_in_id from dual
  8. union all
  9. select '1' as flow_out_id,'5' as flow_in_id from dual
  10. union all
  11. select '1' as flow_out_id,'7' as flow_in_id from dual
  12. union all
  13. select '2' as flow_out_id,'5' as flow_in_id from dual
  14. union all
  15. select '2' as flow_out_id,'4' as flow_in_id from dual
  16. union all
  17. select '2' as flow_out_id,'3' as flow_in_id from dual
  18. union all
  19. select '3' as flow_out_id,'5' as flow_in_id from dual
  20. union all
  21. select '3' as flow_out_id,'4' as flow_in_id from dual
  22. union all
  23. select '4' as flow_out_id,'5' as flow_in_id from dual
  24. union all
  25. select '4' as flow_out_id,'8' as flow_in_id from dual
  26. union all
  27. select '5' as flow_out_id,'6' as flow_in_id from dual
  28. union all
  29. select '5' as flow_out_id,'7' as flow_in_id from dual
  30. union all
  31. select '5' as flow_out_id,'8' as flow_in_id from dual
  32. union all
  33. select '7' as flow_out_id,'6' as flow_in_id from dual
  34. union all
  35. select '6' as flow_out_id,'8' as flow_in_id from dual
  36. )tmp;
  37. drop table if exists EdgeDensity_func_test_result;
  38. create table EdgeDensity_func_test_result
  39. (
  40. node1 string,
  41. node2 string,
  42. node1_edge_cnt bigint,
  43. node2_edge_cnt bigint,
  44. triangle_cnt bigint,
  45. density double
  46. );

對應的圖結構:Snip20160228_13

運行結果
  1. 結果如下:
  2. 1,2,4,4,2,0.5
  3. 2,3,4,4,3,0.75
  4. 2,5,4,7,3,0.75
  5. 3,1,4,4,2,0.5
  6. 3,4,4,4,2,0.5
  7. 4,2,4,4,2,0.5
  8. 4,5,4,7,3,0.75
  9. 5,1,7,4,3,0.75
  10. 5,3,7,4,3,0.75
  11. 5,6,7,3,2,0.66667
  12. 5,8,7,3,2,0.66667
  13. 6,7,3,3,1,0.33333
  14. 7,1,3,4,1,0.33333
  15. 7,5,3,7,2,0.66667
  16. 8,4,3,4,1,0.33333
  17. 8,6,3,3,1,0.33333

pai命令示例

  1. pai -name EdgeDensity
  2. -project algo_public
  3. -DinputEdgeTableName=EdgeDensity_func_test_edge
  4. -DfromVertexCol=flow_out_id
  5. -DtoVertexCol=flow_in_id
  6. -DoutputTableName=EdgeDensity_func_test_result;

算法參數

參數key名稱 參數描述 必/選填 默認值
inputEdgeTableName 輸入邊表名 必填 -
inputEdgeTablePartitions 輸入邊表的分區 選填 全表讀入
fromVertexCol 輸入邊表的起點所在列 必填 -
toVertexCol 輸入邊表的終點所在列 必填 -
outputTableName 輸出表名 必填 -
outputTablePartitions 輸出表的分區 選填 -
lifecycle 輸出表申明周期 選填 -
workerNum 進程數量 選填 未設置
workerMem 進程內存 選填 4096
splitSize 數據切分大小 選填 64

計數三角形

功能介紹

在無向圖G中,輸出所有三角形。

參數設置

maxEdgeCnt:若節點度大於該值,則進行抽樣,默認500,選填。

實例

測試數據

生成數據的SQL:

  1. drop table if exists TriangleCount_func_test_edge;
  2. create table TriangleCount_func_test_edge as
  3. select * from
  4. (
  5. select '1' as flow_out_id,'2' as flow_in_id from dual
  6. union all
  7. select '1' as flow_out_id,'3' as flow_in_id from dual
  8. union all
  9. select '1' as flow_out_id,'4' as flow_in_id from dual
  10. union all
  11. select '1' as flow_out_id,'5' as flow_in_id from dual
  12. union all
  13. select '1' as flow_out_id,'6' as flow_in_id from dual
  14. union all
  15. select '2' as flow_out_id,'3' as flow_in_id from dual
  16. union all
  17. select '3' as flow_out_id,'4' as flow_in_id from dual
  18. union all
  19. select '4' as flow_out_id,'5' as flow_in_id from dual
  20. union all
  21. select '5' as flow_out_id,'6' as flow_in_id from dual
  22. union all
  23. select '5' as flow_out_id,'7' as flow_in_id from dual
  24. union all
  25. select '6' as flow_out_id,'7' as flow_in_id from dual
  26. )tmp;
  27. drop table if exists TriangleCount_func_test_result;
  28. create table TriangleCount_func_test_result
  29. (
  30. node1 string,
  31. node2 string,
  32. node3 string
  33. );

對應的圖結構:Snip20160228_12

運行結果
  1. 結果如下:
  2. 1,2,3
  3. 1,3,4
  4. 1,4,5
  5. 1,5,6
  6. 5,6,7

pai命令示例

  1. pai -name TriangleCount
  2. -project algo_public
  3. -DinputEdgeTableName=TriangleCount_func_test_edge
  4. -DfromVertexCol=flow_out_id
  5. -DtoVertexCol=flow_in_id
  6. -DoutputTableName=TriangleCount_func_test_result;

算法參數

參數key名稱 參數描述 必/選填 默認值
inputEdgeTableName 輸入邊表名 必填 -
inputEdgeTablePartitions 輸入邊表的分區 選填 全表讀入
fromVertexCol 輸入邊表的起點所在列 必填 -
toVertexCol 輸入邊表的終點所在列 必填 -
outputTableName 輸出表名 必填 -
outputTablePartitions 輸出表的分區 選填 -
lifecycle 輸出表申明周期 選填 -
maxEdgeCnt 若節點度大於該值,則進行抽樣。 選填 500
workerNum 進程數量 選填 未設置
workerMem 進程內存 選填 4096
splitSize 數據切分大小 選填 64

樹深度

功能介紹

對於眾多樹狀網絡,輸出每個節點的所處深度和樹ID。

參數設置

實例

測試數據

生成數據的SQL:

  1. drop table if exists TreeDepth_func_test_edge;
  2. create table TreeDepth_func_test_edge as
  3. select * from
  4. (
  5. select '0' as flow_out_id, '1' as flow_in_id from dual
  6. union all
  7. select '0' as flow_out_id, '2' as flow_in_id from dual
  8. union all
  9. select '1' as flow_out_id, '3' as flow_in_id from dual
  10. union all
  11. select '1' as flow_out_id, '4' as flow_in_id from dual
  12. union all
  13. select '2' as flow_out_id, '4' as flow_in_id from dual
  14. union all
  15. select '2' as flow_out_id, '5' as flow_in_id from dual
  16. union all
  17. select '4' as flow_out_id, '6' as flow_in_id from dual
  18. union all
  19. select 'a' as flow_out_id, 'b' as flow_in_id from dual
  20. union all
  21. select 'a' as flow_out_id, 'c' as flow_in_id from dual
  22. union all
  23. select 'c' as flow_out_id, 'd' as flow_in_id from dual
  24. union all
  25. select 'c' as flow_out_id, 'e' as flow_in_id from dual
  26. )tmp;
  27. drop table if exists TreeDepth_func_test_result;
  28. create table TreeDepth_func_test_result
  29. (
  30. node string,
  31. root string,
  32. depth bigint
  33. );

對應的圖結構:image

運行結果
  1. 結果如下:
  2. 0,0,0
  3. 1,0,1
  4. 2,0,1
  5. 3,0,2
  6. 4,0,2
  7. 5,0,2
  8. 6,0,3
  9. a,a,0
  10. b,a,1
  11. c,a,1
  12. d,a,2
  13. e,a,2

pai命令示例

  1. pai -name TreeDepth
  2. -project algo_public
  3. -DinputEdgeTableName=TreeDepth_func_test_edge
  4. -DfromVertexCol=flow_out_id
  5. -DtoVertexCol=flow_in_id
  6. -DoutputTableName=TreeDepth_func_test_result;

算法參數

參數key名稱 參數描述 必/選填 默認值
inputEdgeTableName 輸入邊表名 必填 -
inputEdgeTablePartitions 輸入邊表的分區 選填 全表讀入
fromVertexCol 輸入邊表的起點所在列 必填 -
toVertexCol 輸入邊表的終點所在列 必填 -
outputTableName 輸出表名 必填 -
outputTablePartitions 輸出表的分區 選填 -
lifecycle 輸出表申明周期 選填 -
workerNum 進程數量 選填 未設置
workerMem 進程內存 選填 4096
splitSize 數據切分大小 選填 64

最後更新:2016-11-23 16:04:15

  上一篇:go 文本分析__使用手冊(new)_機器學習-阿裏雲
  下一篇:go 【圖算法】金融風控實驗__案例_機器學習-阿裏雲