閱讀817 返回首頁    go 阿裏雲 go 技術社區[雲棲]


maxCompute平台非標準日期和氣象數據處理方法--電力AI賽Rank 20解決方案

本次競賽主要數據源為企業用電量表Tianchi_power,抽取了揚中市高新區的1000多家企業的用電量(數據進行了脫敏),包括企業ID(匿名化處理),日期和用電量。具體字段如下表:

tianchi_power

列名

類型

含義

示例

record_date

string

日期

20150101

user_id

bigint

企業id

1

power_consumption

bigint

用電量

1031

... ... ... ...

氣象數據表為tianchi_weather_data,起內容如下表所示:

weather

選手提交結果表 

tianchi_power_answer

列名

類型

含義

示例

predict_date

string

日期

2016/9/1

power_consumption

bigint

預測的用電量

1031


這是一個短期負荷預測(short-term load forecasting)問題,國家電網於2010年曾出台過 國家電網企業標準 Q/GDW 552-2010 《電網短期超短期負荷預測技術規範》,在規範中對相關的術語、預測內容、誤差計算公式、常用的預測算法等都做了介紹。在本次比賽中,由於負荷預測的用途不一樣,因此並未完全遵守國家電網的企業標準中規定的預測內容(時間粒度和待預測時長),並且預測誤差評價公式也采用了自定義的公式,但問題的本質並未改變,仍然是一個短期負荷預測問題。

我們在前期做光伏電站超短期發電功率預測時,發現缺失值和數值天氣預報數據對預測精度的影響最大,並且國網的企業標準中對負荷預測的影響因素也有個大致的介紹:

國網負荷預測影響因素

由於社會事件等不可知,因此本次比賽中我們側重解決缺失值和氣象數據的問題,將主要工作集中在三個地方:

1)對官方給定的氣象數據進行編碼、變換等,構建完善的氣象數據特征;

2)構建過擬合的模型來填充缺失值;

3)用修訂數據構建模型一來預測趨勢,原始數據構建模型二來預測用電量水平(大致值),再對兩個模型進行加權融合;

三、 數據預處理

利用ODPS SQL提供的字符串正則處理函數regexp_extract,分別提取年、月、日的數據,然後轉換成標準日期格式,代碼如下:

-- 產生每日用電量總和
DROP TABLE IF EXISTS t_netivs_daily_sum_consumption;
CREATE TABLE IF NOT EXISTS t_netivs_daily_sum_consumption AS
SELECT
	*
	,(year*10000+month*100+day) as day_int -- 轉化成 20160101 這種格式
	,(month*100+day) as month_day
	,(year*100+month) as year_month
	,((year-2015)*12+month) as month_index
FROM
(
	SELECT 
		*
		,cast(regexp_extract(record_date,'(.*)/(.*)/(.*)',1) as bigint) as year   -- 提取年
		,cast(regexp_extract(record_date,'(.*)/(.*)/(.*)',2) as bigint) as month  -- 提取月
		,cast(regexp_extract(record_date,'(.*)/(.*)/(.*)',3) as bigint) as day    -- 提取日
	FROM
	(
		SELECT
			record_date
			,sum(power_consumption) as power_consumption
		FROM
			odps_tc_257100_f673506e024.tianchi_power2
		GROUP BY
			record_date
	)t2
)t1
;

利用這個代碼,可以方便的將2016/1/1這種非標準的日期數據轉化為bigint類型的20160101這類數據,後續可以非常方便的用 to_data(cast(xxx as string),'yyyymmdd') 函數來將這類數據轉化成日期類型。

3.2 節假日的實現

由於比賽過程中原則上是不允許上傳和下載數據的,因此正規的做法是通過ODPS SQL中的case when來實現節假日的處理。這裏給出節假日及日期特征的處理代碼:

-- 產生擴展日期
DROP TABLE IF EXISTS 	t_netivs_date_features;
CREATE TABLE IF NOT EXISTS t_netivs_date_features AS
SELECT
	day_int
	,day_index
	,month_index
	,year_index
	,month
	,day
	,(month*100+day) as month_day
	,(year*100+month) as year_month
	,case when (weekday in (6,7) and special_workday == 0) or holiday==1 then 0 else 1 end as workday
	,weekofyear
	,day_to_lastday
	,month_day_num
	,weekday
	,holiday
	,special_workday
	,special_holiday
	,day1_before_special_holiday
	,day2_before_special_holiday
	,day3_before_special_holiday
	,day1_before_holiday
	,day2_before_holiday
	,day3_before_holiday
	,day1_after_special_holiday
	,day2_after_special_holiday
	,day3_after_special_holiday
	,day1_after_holiday
	,day2_after_holiday
	,day3_after_holiday
FROM
(
	SELECT
		day_int
		,dt
		,datediff(dt,to_date('2015-01-01','yyyy-mm-dd'),'dd')+1 as day_index
		,datediff(dt,to_date('2015-01-01','yyyy-mm-dd'),'mm')+1 as month_index
		,datepart(dt,'yyyy')-2015+1 as year_index
		,datepart(dt,'yyyy') as year
		,datepart(dt,'mm') as month
		,datepart(dt,'dd') as day
		,datepart(lastday(dt),'dd') as month_day_num
		,weekofyear(dt) as weekofyear
		,datediff(lastday(dt),dt,'dd') as day_to_lastday
		,weekday(dt) as weekday
		,holiday
		,special_workday
		,special_holiday
		
		,case when cast(to_char(dateadd(dt,-1,'dd'),'yyyymmdd') as bigint) in (20150101,20150218,20150305,20150405,20150501,20150620,20150903,20150927,20151001,20160101,20160207,20160404,20160501,20160609,20160915,20161001) then 1 else 0 end as day1_before_special_holiday
		,case when cast(to_char(dateadd(dt,-2,'dd'),'yyyymmdd') as bigint) in (20150101,20150218,20150305,20150405,20150501,20150620,20150903,20150927,20151001,20160101,20160207,20160404,20160501,20160609,20160915,20161001) then 1 else 0 end as day2_before_special_holiday
		,case when cast(to_char(dateadd(dt,-3,'dd'),'yyyymmdd') as bigint) in (20150101,20150218,20150305,20150405,20150501,20150620,20150903,20150927,20151001,20160101,20160207,20160404,20160501,20160609,20160915,20161001) then 1 else 0 end as day3_before_special_holiday
		
		,case when cast(to_char(dateadd(dt,-1,'dd'),'yyyymmdd') as bigint) in (20150101,20150218,20150404,20150501,20150620,20150903,20150927,20151001,20160101,20160207,20160404,20160501,20160609,20160915,20161001) then 1 else 0 end as day1_before_holiday
		,case when cast(to_char(dateadd(dt,-2,'dd'),'yyyymmdd') as bigint) in (20150101,20150218,20150404,20150501,20150620,20150903,20150927,20151001,20160101,20160207,20160404,20160501,20160609,20160915,20161001) then 1 else 0 end as day2_before_holiday
		,case when cast(to_char(dateadd(dt,-3,'dd'),'yyyymmdd') as bigint) in (20150101,20150218,20150404,20150501,20150620,20150903,20150927,20151001,20160101,20160207,20160404,20160501,20160609,20160915,20161001) then 1 else 0 end as day3_before_holiday
		
		,case when cast(to_char(dateadd(dt,1,'dd'),'yyyymmdd') as bigint) in (20150101,20150219,20150305,20150405,20150501,20150620,20150903,20150927,20151001,20160101,20160208,20160404,20160501,20160609,20160915,20161001) then 1 else 0 end as day1_after_special_holiday
		,case when cast(to_char(dateadd(dt,2,'dd'),'yyyymmdd') as bigint) in (20150101,20150219,20150305,20150405,20150501,20150620,20150903,20150927,20151001,20160101,20160208,20160404,20160501,20160609,20160915,20161001) then 1 else 0 end as day2_after_special_holiday
		,case when cast(to_char(dateadd(dt,3,'dd'),'yyyymmdd') as bigint) in (20150101,20150219,20150305,20150405,20150501,20150620,20150903,20150927,20151001,20160101,20160208,20160404,20160501,20160609,20160915,20161001) then 1 else 0 end as day3_after_special_holiday
		
		,case when cast(to_char(dateadd(dt,1,'dd'),'yyyymmdd') as bigint) in (20150103,20150224,20150406,20150503,20150622,20150905,20150927,20151007,20160101,20160213,20160404,20160502,20160611,20160917,20161007) then 1 else 0 end as day1_after_holiday
		,case when cast(to_char(dateadd(dt,2,'dd'),'yyyymmdd') as bigint) in (20150103,20150224,20150406,20150503,20150622,20150905,20150927,20151007,20160101,20160213,20160404,20160502,20160611,20160917,20161007) then 1 else 0 end as day2_after_holiday
		,case when cast(to_char(dateadd(dt,3,'dd'),'yyyymmdd') as bigint) in (20150103,20150224,20150406,20150503,20150622,20150905,20150927,20151007,20160101,20160213,20160404,20160502,20160611,20160917,20161007) then 1 else 0 end as day3_after_holiday
		
	FROM
	(
		SELECT
			day_int
			,to_date(to_char(day_int),'yyyymmdd') as dt
			,case when day_int in (20150101,20150102,20150103,20150218,20150219,20150220,20150221,20150222,20150223,20150224,20150404,20150405,20150406,20150501,20150502,20150503,20150620,20150621,20150622,20150903,20150904,20150905,20150927,20151001,20151002,20151003,20151004,20151005,20151006,20151007,20160101,20160207,20160208,20160209,20160210,20160211,20160212,20160213,20160404,20160501,20160502,20160609,20160610,20160611,20160915,20160916,20160917,20161001,20161002,20161003,20161004,20161005,20161006,20161007) then 1 else 0 end as holiday
			,case when day_int in (20150104,20150215,20150228,20150906,20151010,20160206,20160214,20160612,20160918,20161008,20161009) then 1 else 0 end as special_workday
			,case when day_int in (20150101,20150218,20150219,20150305,20150405,20150501,20150620,20150903,20150927,20151001,20160101,20160207,20160208,20160404,20160501,20160609,20160915,20161001) then 1 else 0 end as special_holiday
			
		FROM
			t_netivs_tianchi_weather_data	
	)t1
)t2
;

通過對數據進行分析,可以很容易的發現,節假日對每日用電總量的影響非常大,而且節假日對每日用電總量的影響有一定的延續性,比如某些節日快到的時候,用電量會突然增加或者下降,有些節日結束後,會有連續幾天的用電量增加或者下降,因此這裏對節假日做了比較細致的處理,增加了節前1/2/3天和節後1/2/3天的特征。

從上麵的氣象數據可以看出來,其中的氣象、風速、風向等都是字符串數據,需要轉換成數值型的數據才能用於機器學習模型。由於這裏用的字符串可能的類型有限,其中一種方法是將字符串排序,用序號代表該字符串的編碼,直接用於機器學習模型的輸入特征。這種方式的好處是處理簡單,借助ODPS SQL內置的row_number函數可以很方便的進行實現。但是這種時間的缺點也很明顯:沒有充分的利用不同氣象類型之間的關聯關係,比如大雨跟大到暴雨的關係。因此,我們這裏采用了OPEN_MR來對氣象數據進行了詳細的處理,主要的處理思路為:

1)將數據表中所有的數據類型都找出來,觀察其構成情況及類別;

2)考慮到部分氣象隻有一種類型,比如“大雨、中雨、小雨”,而有的氣象是兩種氣象類型,如“大到暴雨、多雲轉陰”等,因此,將所有氣象進行統一:隻有一種類型的,就用兩個一樣的類型來表示;

3)對於每個類型的氣象,設計 氣象類型(晴、雪、雨等)、氣象等級(小雨、中雨、大雨、暴雨等分別從1開始編號)、氣象組合(氣象類型+氣象等級);

按這種思路處理後的氣象數據的格式可以用如下的ODPS SQL語句來創建,並且用於OPEN_MR的輸出表:

-- map reduce來處理氣象數據的輸出表
-- 線上給的12月份的氣象數據已經一起完成了,所以不需要再更改
-- DROP TABLE IF EXISTS t_netivs_encode_weather;
CREATE TABLE IF NOT EXISTS t_netivs_encode_weather (
	day_int 			bigint
	,temperature_high 	bigint
	,temperature_low 	bigint
	,weather1			bigint
	,weather1_level		bigint
	,weather1_type		bigint
	,weather2			bigint
	,weather2_level		bigint
	,weather2_type		bigint
	,wind_direction		bigint
	,wind_speed			double
	,wind_speed1		double
	,wind_speed2		double
)
;

為了實現對氣象數據的解析,編寫了一個OPEN_MR來進行處理,其核心代碼如下:

package powerai.weather_preprocess;


import java.util.regex.*;
import java.util.Map;
import java.util.HashMap;

import com.aliyun.odps.data.Record;

public class WeatherPreprocesser {
	// 最後輸出用的列表
	public Record m_output_vals = null;
	//private AntibotUnitily util = new AntibotUnitily();
	public static long 		m_day_int 		= 0;
	public static long 		m_temp_high 	= 0;
	public static long 		m_temp_low 		= 0;

	//public static String 	m_org_weather;
	//public static String 	m_org_wind_direction;
	//public static String 	m_org_wind_speed;
	
	// 輸出列名:day_int:bigint,temperature_high:bigint,temperature_low:bigint,
	// weather1:bigint,weather1_level:bigint,weather1_type:bigint,weather2:bigint,weather2_level:bigint,weather2_type:bigint,wind_direction:bigint,wind_speed:double,wind_speed1:double,wind_speed2:double
	long	m_weather1			= 0L;
	long 	m_weather1_level 	= 0L;
	long	m_weather1_type 	= 0L;
	long	m_weather2			= 0L;
	long	m_weather2_level 	= 0L;
	long	m_weather2_type 	= 0L;
	long	m_wind_direction 	= 0L;
	double	m_wind_speed		= 0.0;
	double 	m_wind_speed1  		= 0.0;
	double 	m_wind_speed2  		= 0.0;
	
	// 小雨、小到中雨、中雨、中到大雨、大雨、大到暴雨、暴雨、陣雨、雷雨、雷陣雨、小雪、中雪、大雪、雨夾雪、晴、陰、多雲
	// 雨:小雨、小到中雨、中雨、中到大雨、大雨、大到暴雨、暴雨、陣雨、雷雨、雷陣雨
	// 雪:小雪、中雪、大雪、雨夾雪
	// 晴
	// 陰
	// 多雲
		
	public void weather_encode(long day_int, long temperature_high, long temperature_low, String weather, String wind_direction, String wind_speed, Record vals){
				
		m_output_vals 		= vals;
		m_day_int 			= day_int;
		m_temp_high 		= temperature_high;
		m_temp_low 			= temperature_low;
				
		reset();
		
		weather_parser(weather);
		wind_direction_parser(wind_direction);
		wind_speed_parse(wind_speed);
				
		// 輸出特征
		output();
	}
	
	private void reset(){

		m_weather1			= 0L;
		m_weather1_level 	= 0L;
		m_weather1_type 	= 0L;
		m_weather2			= 0L;
		m_weather2_level 	= 0L;
		m_weather2_type 	= 0L;
		m_wind_direction 	= 0L;	
		m_wind_speed  		= 0.0;
		m_wind_speed1  		= 0.0;
		m_wind_speed2  		= 0.0;
	}
	
	
	// -------------- 對氣象進行重新編碼 ---------------------------------------------//
	private void weather_parser(String weather){
			
		String weather1,weather2;
		
		// 如果最後一個字母是 ~ ,應該是不數據不完整,直接去掉 ~
		if(weather.endsWith("~")){
			weather = weather.substring(0, weather.length()-2);
		}
		
		weather = weather.replace("轉", "~");
		
		// 解析a1的數據
        if(weather.contains("~")){
			
        	weather1 = weather.split("~")[0];
        	weather2 = weather.split("~")[1];
        }
        else {
        	weather1= weather;
        	weather2 = weather;
        }
        
        // 開始解析weather1和weather2
        // 小雨、小到中雨、中雨、中到大雨、大雨、大到暴雨、暴雨、陣雨、雷雨、雷陣雨、小雪、中雪、大雪、雨夾雪、晴、陰、多雲
        m_weather1 = get_weather_index(weather1);
        m_weather1_level = get_weather_level(weather1);
        m_weather1_type = get_weather_type(weather1);
        
        m_weather2 = get_weather_index(weather2);
        m_weather2_level = get_weather_level(weather2);
        m_weather2_type = get_weather_type(weather2);
        
	}
	
	private long get_weather_index(String strWeather) {		
		return get_weather_type(strWeather)*100+get_weather_level(strWeather);
	}
	
	private long get_weather_level(String strWeather) {
		if(strWeather.contains("雪")) {
			// 小雪、中雪、大雪、雨夾雪
			if(strWeather.contains("小雪")) {
				return 1L;
			}
			else if(strWeather.contains("中雪")) {
				return 2L;
			}
			else if(strWeather.contains("大雪")) {
				return 3L;
			}
			else if(strWeather.contains("雨夾雪")) {
				return 4L;
			}
			else {
				return 0L;
			}
		}
		else if(strWeather.contains("雨")) {
			//小雨、小到中雨、中雨、中到大雨、大雨、大到暴雨、暴雨、陣雨、雷雨、雷陣雨
			if(strWeather.contains("小雨")) {
				return 1L;
			}
			else if(strWeather.contains("小到中雨")) {
				return 2L;
			}
			else if(strWeather.contains("中雨")) {
				return 3L;
			}
			else if(strWeather.contains("中到大雨")) {
				return 4L;
			}
			else if(strWeather.contains("大雨")) {
				return 5L;
			}
			else if(strWeather.contains("大到暴雨")) {
				return 6L;
			}
			else if(strWeather.contains("暴雨")) {
				return 7L;
			}
			else if(strWeather.contains("陣雨")) {
				return 8L;
			}
			else if(strWeather.contains("雷雨")) {
				return 9L;
			}
			else if(strWeather.contains("雷陣雨")) {
				return 10L;
			}
			else {
				return 0L;
			}
		}
		else {		
			return 0L;
		}
	}
	
	private long get_weather_type(String strWeather) {
		if(strWeather.contains("雪")) {
			return 1L;
		}
		else if(strWeather.contains("雨")) {
			return 2L;
		}
		else if(strWeather.contains("晴")) {
			return 3L;
		}
		else if(strWeather.contains("陰")) {
			return 4L;
		}
		else if(strWeather.contains("多雲")) {
			return 5L;
		}
		else{
			return 0L;
		}
	}

	// -------------- 對風向進行重新編碼 ---------------------------------------------//
	private void wind_direction_parser(String wind_direction){
		m_wind_direction = get_wind_direction(wind_direction);
	}
	
	private long get_wind_direction(String strWindDirection) {
		
		if(strWindDirection == null ||strWindDirection.trim().length() <2) {
			return 0L;
		}
		
		// 東風、東南風、南風、西南風、西風、西北風、北風、東北風
		if(strWindDirection.contains("東風")) {
			return 1L;
		}
		else if(strWindDirection.contains("東南風")) {
			return 2L;
		}
		else if(strWindDirection.contains("南風")) {
			return 3L;
		}
		else if(strWindDirection.contains("西南風")) {
			return 4L;
		}
		else if(strWindDirection.contains("西風")) {
			return 5L;
		}
		else if(strWindDirection.contains("西北風")) {
			return 6L;
		}
		else if(strWindDirection.contains("北風")) {
			return 7L;
		}
		else if(strWindDirection.contains("東北風")) {
			return 8L;
		}
		else {
			return 0L;
		}
	} 
	
	// -------------- 對風速進行重新編碼 ---------------------------------------------//
	private void wind_speed_parse(String strWindSpeed){
		String strWindSpeed1,strWindSpeed2;
		
		
		strWindSpeed = strWindSpeed.replace("轉", "~");
		
		// 解析a1的數據
        if(strWindSpeed.contains("~")){
			
        	strWindSpeed1 = strWindSpeed.split("~")[0];
        	strWindSpeed2 = strWindSpeed.split("~")[1];
        }
        else {
        	strWindSpeed1= strWindSpeed;
        	strWindSpeed2 = strWindSpeed;
        }
        
        // 開始解析strWindSpeed1和strWindSpeed2
        // 微風、1級、2級、小於3級、3級、3-4級、4級、4-5級、5級、5-6級、6級、6-7級、7級
        m_wind_speed1 	= get_wind_speed_code(strWindSpeed1);
        m_wind_speed2 	= get_wind_speed_code(strWindSpeed2);
        m_wind_speed 	= m_wind_speed1*100 + m_wind_speed2;
	}
	
	private double get_wind_speed_code(String strWindSpeed) {
        // 微風、1級、2級、小於3級、3級、3-4級、4級、4-5級、5級、5-6級、6級、6-7級、7級
		if(strWindSpeed.contains("微風")) {
			return 0.5;
		}
		else if(strWindSpeed.contains("1級")) {
			return 1.0;
		}
		else if(strWindSpeed.contains("2級")) {
			return 2.0;
		}
		else if(strWindSpeed.contains("小於3級")) {
			return 2.5;
		}
		else if(strWindSpeed.contains("3級")) {
			return 3.0;
		}
		else if(strWindSpeed.contains("3-4級")) {
			return 3.5;
		}
		else if(strWindSpeed.contains("4級")) {
			return 4.0;
		}
		else if(strWindSpeed.contains("4-5級")) {
			return 4.5;
		}
		else if(strWindSpeed.contains("5級")) {
			return 5.0;
		}
		else if(strWindSpeed.contains("5-6級")) {
			return 5.5;
		}
		else if(strWindSpeed.contains("6級")) {
			return 6.0;
		}
		else if(strWindSpeed.contains("6-7級")) {
			return 6.5;
		}
		else if(strWindSpeed.contains("7級")) {
			return 7.0;
		}
		else {
			return 0.0;
		}
	}

	private void output(){
		int idx = 0;

		m_output_vals.setBigint(idx++, m_weather1);
		m_output_vals.setBigint(idx++, m_weather1_level);
		m_output_vals.setBigint(idx++, m_weather1_type);
		m_output_vals.setBigint(idx++, m_weather2);
		m_output_vals.setBigint(idx++, m_weather2_level);
		m_output_vals.setBigint(idx++, m_weather2_type);
		m_output_vals.setBigint(idx++, m_wind_direction);		
		m_output_vals.setDouble(idx++, m_wind_speed);
		m_output_vals.setDouble(idx++, m_wind_speed1);
		m_output_vals.setDouble(idx++, m_wind_speed2);
		
	}
	
}

-- 618全部700天都是1,這種就不要去修訂了,肯定一直會是缺失的(用電量為1)
-- 由此引出問題:用電量為1的數據,什麼情況下需要修訂,什麼情況下不需要修訂?
-- 思考:
-- 1)用戶的用電量要足夠大,否則不足以對整體的預測值造成影響;
-- 2)大於1的值要足夠多,否則無法修訂準確。
-- 因此,可以考慮:
-- 1)采用規則:過濾掉缺失值後,計算剩餘天數的每日用電量均值,超過100的才統計

-- 幾個重點關注的user_id:
-- 1307(1310,這兩個完全一樣),缺118
-- 514,缺200
-- 1355,缺118
-- 1133,缺547
-- 1308 (1309,這兩個完全一樣),缺568
-- 741,缺672

-- 根據最近3個月來篩選要重點關注的user_id:
-- 1146, 近3月缺失數:(22,31,8)
-- 1309, 近3月缺失數:(30,15,29)
-- 1147, 近3月缺失數:(30,19,0)
-- 1301, 近3月缺失數:(30,31,10)
-- 200, 近3月缺失數:(30,31,10)
-- 654, 近3月缺失數:(5,31,30)
-- 1421, 近3月缺失數:(30,31,7)
-- 650, 近3月缺失數:(0,0,10)

-- 經過詳細分析,擬定采用的缺失數據填充規則:
-- 1. 11月份缺失值為30,所有曆史用電量改成1;
-- 2. 除了11月份缺失值為30天的,其他non_default_power_consumption_median<2500的都不處理;
-- 3. 總缺失天數大於30的不處理;

DROP TABLE IF EXISTS t_netivs_user_missing_info;
CREATE TABLE IF NOT EXISTS t_netivs_user_missing_info AS
select
	case when t11.user_id is not null then t11.user_id else t2.user_id end as user_id
	,case when t11.missing_day_cnt is null then 0 else t11.missing_day_cnt end as missing_day_cnt
	,case when t11.first_default_day_int is null then 0 else t11.first_default_day_int end as first_default_day_int
	,case when t11.last_default_day_int is null then 0 else  t11.last_default_day_int end as last_default_day_int
	,case when t11.last1month_default_day_cnt is null then 0 else  t11.last1month_default_day_cnt end as last1month_default_day_cnt
	,case when t11.last2month_default_day_cnt is null then 0 else  t11.last2month_default_day_cnt end as last2month_default_day_cnt
	,case when t11.last3month_default_day_cnt is null then 0 else  t11.last3month_default_day_cnt end as last3month_default_day_cnt
	,case when t2.power_consumption_avg is null then 0 else  t2.power_consumption_avg end as power_consumption_avg
	,case when t2.power_consumption_median is null then 0 else  t2.power_consumption_median end as power_consumption_median
	,case when t2.power_consumption_max is null then 0 else  t2.power_consumption_max end as power_consumption_max
	,case when t2.power_consumption_min is null then 0 else  t2.power_consumption_min end as power_consumption_min
	,case when t2.first_non_default_day_int is null then 0 else  t2.first_non_default_day_int end as first_non_default_day_int
	,case when t2.last_non_default_day_int is null then 0 else  t2.last_non_default_day_int end as last_non_default_day_int
from
(
	select
		* 
	from
	(
		select 
			user_id
			,count(*) as missing_day_cnt
			,min(day_int) as first_default_day_int
			,max(day_int) as last_default_day_int
			,SUM(case when day_int>=20161101 and day_int<20161201 then 1 else 0 end) as last1month_default_day_cnt
			,SUM(case when day_int>=20161001 and day_int<20161101 then 1 else 0 end) as last2month_default_day_cnt
			,SUM(case when day_int>=20160901 and day_int<20161001 then 1 else 0 end) as last3month_default_day_cnt			
		from 
			t_netivs_ext_power
		where
			power_consumption=1
		group by
			user_id
	)t1
		where missing_day_cnt>1
)t11
FULL OUTER JOIN
(
	select 
		user_id
		,avg(power_consumption) as power_consumption_avg
		,median(power_consumption) as power_consumption_median
		,max(power_consumption) as power_consumption_max
		,min(power_consumption) as power_consumption_min
		,min(day_int) as first_non_default_day_int
		,max(day_int) as last_non_default_day_int
	from 
		t_netivs_ext_power
	where
		power_consumption<>1
	group by
		user_id
)t2
ON t11.user_id = t2.user_id
;

SELECT * FROM t_netivs_user_missing_info where missing_day_cnt>0 ORDER BY power_consumption_median desc limit 500;

-- 產生要用xgboost來填充的user_id的列表
DROP TABLE IF EXISTS t_netivs_xgb_fill_user_day_list;
DROP TABLE IF EXISTS t_netivs_gbdt_fill_user_day_list;
CREATE TABLE IF NOT EXISTS t_netivs_gbdt_fill_user_day_list AS
	SELECT
		user_id
		,day_int
	FROM
		t_netivs_ext_power
	WHERE
		power_consumption =1 and user_id in 
		(
			SELECT 
				user_id
			FROM
				t_netivs_user_missing_info
			WHERE
				power_consumption_median>2500 and missing_day_cnt<30 and missing_day_cnt>0
		)		
;


-- 產生要用來訓練xgboost模型的user_id列表
DROP TABLE IF EXISTS t_netivs_xgb_fill_train_user_list;
DROP TABLE IF EXISTS t_netivs_gbdt_fill_train_user_list;
CREATE TABLE IF NOT EXISTS t_netivs_gbdt_fill_train_user_list AS
	SELECT
		user_id
		,day_int
	FROM
		t_netivs_ext_power
	WHERE
		power_consumption <> 1 and user_id in 
		(
			SELECT 
				user_id
			FROM
				t_netivs_user_missing_info
			WHERE
				power_consumption_median>2500 and missing_day_cnt<30
		)
;


-- 產生要把曆史數據全部清0的user_id的列表
DROP TABLE IF EXISTS t_netivs_clear_historical_data_user_list;
CREATE TABLE IF NOT EXISTS t_netivs_clear_historical_data_user_list AS
	SELECT 
		user_id
	FROM
		t_netivs_user_missing_info
	WHERE
		last1month_default_day_cnt=30
;


-- 產生GBDT訓練集

DROP TABLE IF EXISTS t_netivs_gbdt_fill_consumption_features;
CREATE TABLE IF NOT EXISTS t_netivs_gbdt_fill_consumption_features AS
	SELECT
		t1.user_id
		,t1.day_int
		,case when t2.weekly_power_consumption_avg is null then 0 else t2.weekly_power_consumption_avg end as weekly_power_consumption_avg
		,case when t2.weekly_power_consumption_median is null then 0 else t2.weekly_power_consumption_median end as weekly_power_consumption_median
		,case when t3.monthly_power_consumption_avg is null then 0 else t3.monthly_power_consumption_avg end as monthly_power_consumption_avg
		,case when t3.monthly_power_consumption_median is null then 0 else t3.monthly_power_consumption_median end as monthly_power_consumption_median
		,case when t4.last_weekly_power_consumption_avg is null then 0 else t4.last_weekly_power_consumption_avg end as last_weekly_power_consumption_avg
		,case when t4.last_weekly_power_consumption_median is null then 0 else t4.last_weekly_power_consumption_median end as last_weekly_power_consumption_median
		,case when t5.last_monthly_power_consumption_avg is null then 0 else t5.last_monthly_power_consumption_avg end as last_monthly_power_consumption_avg
		,case when t5.last_monthly_power_consumption_median is null then 0 else t5.last_monthly_power_consumption_median end as last_monthly_power_consumption_median
	FROM
		t_netivs_ext_power t1
	LEFT OUTER JOIN
	(
		SELECT
			user_id			
			,weekofyear
			,avg(power_consumption) as weekly_power_consumption_avg
			,median(power_consumption) as weekly_power_consumption_median
		FROM
			t_netivs_ext_power
		WHERE
			power_consumption<>1
		GROUP BY
			user_id,weekofyear
	)t2
	ON t1.user_id = t2.user_id and t1.weekofyear = t2.weekofyear
	LEFT OUTER JOIN
	(
		SELECT
			user_id			
			,year_month
			,avg(power_consumption) as monthly_power_consumption_avg
			,median(power_consumption) as monthly_power_consumption_median
		FROM
			t_netivs_ext_power
		WHERE
			power_consumption<>1
		GROUP BY
			user_id,year_month
	)t3
	ON t1.user_id = t3.user_id and t1.year_month = t3.year_month
	LEFT OUTER JOIN
	(
		SELECT
			user_id			
			,weekofyear
			,avg(power_consumption) as last_weekly_power_consumption_avg
			,median(power_consumption) as last_weekly_power_consumption_median
		FROM
			t_netivs_ext_power
		WHERE
			power_consumption<>1
		GROUP BY
			user_id,weekofyear
	)t4
	ON t1.user_id = t4.user_id and t1.weekofyear = t4.weekofyear+1
	LEFT OUTER JOIN
	(
		SELECT
			user_id			
			,year_month
			,avg(power_consumption) as last_monthly_power_consumption_avg
			,median(power_consumption) as last_monthly_power_consumption_median
		FROM
			t_netivs_ext_power
		WHERE
			power_consumption<>1
		GROUP BY
			user_id,year_month
	)t5
	ON t1.user_id = t5.user_id and t1.year_month = t5.year_month+1
;


DROP TABLE IF EXISTS t_netivs_gbdt_fill_train_features;
CREATE TABLE IF NOT EXISTS t_netivs_gbdt_fill_train_features AS
	SELECT
		t1.user_id
		,t1.day_int
		,t2.temperature_high
		,t2.temperature_low
		,t2.weather1
		,t2.weather1_level
		,t2.weather1_type
		,t2.weather2
		,t2.weather2_level
		,t2.weather2_type
		,t2.wind_direction
		,t2.wind_speed
		,t2.wind_speed1
		,t2.wind_speed2
		,t3.day_index
		,t3.month_index
		,t3.year_index
		,t3.month
		,t3.day
		,t3.workday
		,t3.weekday
		,t3.holiday
		,t3.special_workday
		,t3.special_holiday
		,t3.day1_before_special_holiday
		,t3.day2_before_special_holiday
		,t3.day3_before_special_holiday
		,t3.day1_before_holiday
		,t3.day2_before_holiday
		,t3.day3_before_holiday
		,t3.day1_after_special_holiday
		,t3.day2_after_special_holiday
		,t3.day3_after_special_holiday
		,t3.day1_after_holiday
		,t3.day2_after_holiday
		,t3.day3_after_holiday
		,t4.weekly_power_consumption_avg
		,t4.weekly_power_consumption_median
		,t4.monthly_power_consumption_avg
		,t4.monthly_power_consumption_median
		,t4.last_weekly_power_consumption_avg
		,t4.last_weekly_power_consumption_median
		,t4.last_monthly_power_consumption_avg
		,t4.last_monthly_power_consumption_median
		,t5.power_consumption
	FROM
		t_netivs_gbdt_fill_train_user_list t1
	LEFT OUTER JOIN
		t_netivs_encode_weather t2
	ON t1.day_int = t2.day_int
	LEFT OUTER JOIN
		t_netivs_date_features t3
	ON t1.day_int = t3.day_int
	LEFT OUTER JOIN
		t_netivs_gbdt_fill_consumption_features t4
	ON t1.user_id = t4.user_id and t1.day_int = t4.day_int
	LEFT OUTER JOIN
		t_netivs_ext_power t5
	ON t1.user_id = t5.user_id and t1.day_int = t5.day_int
	
;

-- 產生gbdt填充的測試集
DROP TABLE IF EXISTS t_netivs_gbdt_fill_test_features;
CREATE TABLE IF NOT EXISTS t_netivs_gbdt_fill_test_features AS
	SELECT
		t1.user_id
		,t1.day_int
		,t2.temperature_high
		,t2.temperature_low
		,t2.weather1
		,t2.weather1_level
		,t2.weather1_type
		,t2.weather2
		,t2.weather2_level
		,t2.weather2_type
		,t2.wind_direction
		,t2.wind_speed
		,t2.wind_speed1
		,t2.wind_speed2
		,t3.day_index
		,t3.month_index
		,t3.year_index
		,t3.month
		,t3.day
		,t3.workday
		,t3.weekday
		,t3.holiday
		,t3.special_workday
		,t3.special_holiday
		,t3.day1_before_special_holiday
		,t3.day2_before_special_holiday
		,t3.day3_before_special_holiday
		,t3.day1_before_holiday
		,t3.day2_before_holiday
		,t3.day3_before_holiday
		,t3.day1_after_special_holiday
		,t3.day2_after_special_holiday
		,t3.day3_after_special_holiday
		,t3.day1_after_holiday
		,t3.day2_after_holiday
		,t3.day3_after_holiday
		,t4.weekly_power_consumption_avg
		,t4.weekly_power_consumption_median
		,t4.monthly_power_consumption_avg
		,t4.monthly_power_consumption_median
		,t4.last_weekly_power_consumption_avg
		,t4.last_weekly_power_consumption_median
		,t4.last_monthly_power_consumption_avg
		,t4.last_monthly_power_consumption_median
	FROM
		t_netivs_gbdt_fill_user_day_list t1
	LEFT OUTER JOIN
		t_netivs_encode_weather t2
	ON t1.day_int = t2.day_int
	LEFT OUTER JOIN
		t_netivs_date_features t3
	ON t1.day_int = t3.day_int
	LEFT OUTER JOIN
		t_netivs_gbdt_fill_consumption_features t4
	ON t1.user_id = t4.user_id and t1.day_int = t4.day_int
;

-- gbdt模型來過擬合預測一個值作為實際值(填充)
-- DROP OFFLINEMODEL IF EXISTS m_gbdt_fill_model;
-- DROP TABLE IF EXISTS t_netivs_gbdt_fill_model_feature_importance;
-- -- train
-- PAI -name GBDT 
-- -project algo_public 
-- -DfeatureSplitValueMaxSize="500" 
-- -DlossType="3" 
-- -DrandSeed="0" 
-- -DnewtonStep="1" 
-- -Dshrinkage="0.05" 
-- -DmaxLeafCount="32" 
-- -DlabelColName="power_consumption" 
-- -DinputTableName="t_netivs_gbdt_fill_train_features" 
-- -DoutputImportanceTableName="t_netivs_gbdt_fill_model_feature_importance" 
-- -DminLeafSampleCount="500" 
-- -DsampleRatio="0.6" 
-- -DmaxDepth="10" 
-- -DmodelName="m_gbdt_fill_model" 
-- -DmetricType="0" 
-- -DfeatureRatio="0.6" 
-- -Dp="1" 
-- -Dtau="0.6" 
-- -DtestRatio="0" 
-- -DfeatureColNames="user_id,temperature_high,temperature_low,weather1,weather1_level,weather1_type,weather2,weather2_level,weather2_type,wind_direction,day_index,month_index,year_index,month,day,workday,weekday,holiday,special_workday,special_holiday,day1_before_special_holiday,day2_before_special_holiday,day3_before_special_holiday,day1_before_holiday,day2_before_holiday,day3_before_holiday,day1_after_special_holiday,day2_after_special_holiday,day3_after_special_holiday,day1_after_holiday,day2_after_holiday,day3_after_holiday,wind_speed,wind_speed1,wind_speed2,weekly_power_consumption_avg,weekly_power_consumption_median,monthly_power_consumption_avg,monthly_power_consumption_median,last_weekly_power_consumption_avg,last_weekly_power_consumption_median,last_monthly_power_consumption_avg,last_monthly_power_consumption_median" 
-- -DtreeCount="2000"
-- ;


-- predict
-- DROP TABLE IF EXISTS t_netivs_gbdt_fill_prediction_result;
-- PAI
-- -name prediction
-- -project algo_public
-- -DdetailColName="prediction_detail"
-- -DappendColNames="user_id,day_int"
-- -DmodelName="m_gbdt_fill_model"
-- -DitemDelimiter=","
-- -DresultColName="prediction_result"
-- -Dlifecycle="28"
-- -DoutputTableName="t_netivs_gbdt_fill_prediction_result"
-- -DscoreColName="prediction_score"
-- -DkvDelimiter=":"
-- -DfeatureColNames="user_id,temperature_high,temperature_low,weather1,weather1_level,weather1_type,weather2,weather2_level,weather2_type,wind_direction,day_index,month_index,year_index,month,day,workday,weekday,holiday,special_workday,special_holiday,day1_before_special_holiday,day2_before_special_holiday,day3_before_special_holiday,day1_before_holiday,day2_before_holiday,day3_before_holiday,day1_after_special_holiday,day2_after_special_holiday,day3_after_special_holiday,day1_after_holiday,day2_after_holiday,day3_after_holiday,wind_speed,wind_speed1,wind_speed2,weekly_power_consumption_avg,weekly_power_consumption_median,monthly_power_consumption_avg,monthly_power_consumption_median,last_weekly_power_consumption_avg,last_weekly_power_consumption_median,last_monthly_power_consumption_avg,last_monthly_power_consumption_median"
-- -DinputTableName="t_netivs_gbdt_fill_test_features"
-- -DenableSparse="false";

-- SELECT * FROM t_netivs_gbdt_fill_prediction_result ORDER BY prediction_result DESC limit 500;

-- 用xgb來產生填充值
DROP TABLE IF EXISTS t_netivs_xgb_fill_prediction_result;
DROP OFFLINEMODEL IF EXISTS m_xgb_fill_model;

-- train
PAI
-name xgboost
-project algo_public
-Deta="0.01"
---Dobjective="reg:linear"
-Dobjective="reg:linear"
-DitemDelimiter=","
-Dseed="0"
-Dnum_round="3500"
-DlabelColName="power_consumption"
-DinputTableName="t_netivs_gbdt_fill_train_features"
-DenableSparse="false"
-Dmax_depth="8"
-Dsubsample="0.4"
-Dcolsample_bytree="0.6"
-DmodelName="m_xgb_fill_model"
-Dgamma="0"
-Dlambda="50" 
-DfeatureColNames="user_id,temperature_high,temperature_low,weather1,weather1_level,weather1_type,weather2,weather2_level,weather2_type,wind_direction,day_index,month_index,year_index,month,day,workday,weekday,holiday,special_workday,special_holiday,day1_before_special_holiday,day2_before_special_holiday,day3_before_special_holiday,day1_before_holiday,day2_before_holiday,day3_before_holiday,day1_after_special_holiday,day2_after_special_holiday,day3_after_special_holiday,day1_after_holiday,day2_after_holiday,day3_after_holiday,wind_speed,wind_speed1,wind_speed2,weekly_power_consumption_avg,weekly_power_consumption_median,monthly_power_consumption_avg,monthly_power_consumption_median,last_weekly_power_consumption_avg,last_weekly_power_consumption_median,last_monthly_power_consumption_avg,last_monthly_power_consumption_median"
-Dbase_score="0.11"
-Dmin_child_weight="100"
-DkvDelimiter=":";


-- predict
PAI
-name prediction
-project algo_public
-DdetailColName="prediction_detail"
-DappendColNames="day_int"
-DmodelName="m_xgb_fill_model"
-DitemDelimiter=","
-DresultColName="prediction_result"
-Dlifecycle="28"
-DoutputTableName="t_netivs_xgb_fill_prediction_result"
-DscoreColName="prediction_score"
-DkvDelimiter=":"
-DfeatureColNames="user_id,temperature_high,temperature_low,weather1,weather1_level,weather1_type,weather2,weather2_level,weather2_type,wind_direction,day_index,month_index,year_index,month,day,workday,weekday,holiday,special_workday,special_holiday,day1_before_special_holiday,day2_before_special_holiday,day3_before_special_holiday,day1_before_holiday,day2_before_holiday,day3_before_holiday,day1_after_special_holiday,day2_after_special_holiday,day3_after_special_holiday,day1_after_holiday,day2_after_holiday,day3_after_holiday,wind_speed,wind_speed1,wind_speed2,weekly_power_consumption_avg,weekly_power_consumption_median,monthly_power_consumption_avg,monthly_power_consumption_median,last_weekly_power_consumption_avg,last_weekly_power_consumption_median,last_monthly_power_consumption_avg,last_monthly_power_consumption_median"
-DinputTableName="t_netivs_gbdt_fill_test_features"
-DenableSparse="false";

SELECT * FROM t_netivs_xgb_fill_prediction_result ORDER BY prediction_result desc limit 100;

-- 產生修訂後的每日用電量詳單
-- t_netivs_clear_historical_data_user_list內的user_id全部清零
DROP TABLE IF EXISTS t_netivs_fixed_ext_power;
CREATE TABLE IF NOT EXISTS t_netivs_fixed_ext_power AS
	SELECT
		t1.user_id
		,t1.day_int
		--,cast(case when t2.prediction_result is not null then round(t2.prediction_result,0) when t3.user_id is not null then 1 else power_consumption end as bigint) as power_consumption
		,cast(case when t2.prediction_result is not null then round(t2.prediction_result,0) else power_consumption end as bigint) as power_consumption
	FROM
		t_netivs_ext_power t1
	LEFT OUTER JOIN
		t_netivs_gbdt_fill_prediction_result t2
	ON t1.user_id = t2.user_id and t1.day_int = t2.day_int
	LEFT OUTER JOIN
		t_netivs_clear_historical_data_user_list t3
	ON t1.user_id = t3.user_id
;

-- 產生修訂後的每日用電量總和
DROP TABLE IF EXISTS t_netivs_fixed_daily_sum_consumption;
CREATE TABLE IF NOT EXISTS t_netivs_fixed_daily_sum_consumption AS
SELECT
	t1.day_int
	,t1.power_consumption
	,t2.fixed_power_consumption

最後更新:2017-07-26 09:32:44

  上一篇:go  Maxcompute的任務狀態和多任務執行
  下一篇:go  商城之購物車商品添加和結算的小效果----基於Vue.js