閱讀403 返回首頁    go 阿裏雲 go 技術社區[雲棲]


1997-2007,KDD CUP的二十年

2017年8月13-17日,第23屆KDD大會在加拿大哈利法克斯召開。KDD CUP是ACM SIGKDD組織的有關數據挖掘和知識發現領域的年度賽事,作為KDD年會的重要組成部分,從1997年至今,已有二十年的曆史,是目前數據挖掘領域最有影響力的賽事。今天,我們就一起來回顧下這二十年的KDD CUP吧。

KDD Cup 1997 Direct marketing for lift curve optimization 預測出最可能的善款捐贈人

Intro:This year's challenge is to predict who is most likely to donate to a charity. Contestants were evaluated on the accuracy on the validation data set.Note: the data used in KDD Cup 1997 is exactly the same as KDD Cup 1998.今年的挑戰是預測誰最有可能捐贈給慈善機構。選手們對驗證數據集的準確性進行了評估。注:1997年KDD杯使用的數據與1998年KDD杯完全相同。

Data:https://www.kdd.org/kdd-cup/view/kdd-cup-1997/Data

Results:

  • First Place (jointly shared):

    • Charles Elkan (University of California, San Diego) with his software BNB, Boosted Naive Bayesian Classifier
    • Urban Science Applications, Inc. with their software gain, Direct Marketing Selection System
  • Runner Up:

    • Silicon Graphics, Inc. with their software MineSet
  • KDD Cup 1998 Direct marketing for profit optimization 生成最佳直銷名單

Intro:

The competition task is a regression problem where the goal is to estimate the return from a direct mailing in order to maximize donation profits.

Tasks:https://www.kdd.org/kdd-cup/view/kdd-cup-1998/Tasks

Data:https://www.kdd.org/kdd-cup/view/kdd-cup-1998/Data

Results

  • First Place:
    Urban Science Applications, Inc. with their software GainSmarts

  • First Runner Up:
    SAS Institute, Inc. with their software Enterprise Miner

  • Second Runner Up:
    Quadstone Limited with their software Decisionhouse

KDD Cup 1999 Computer network intrusion detection 網絡侵入偵測及報告

Intro:

The task for the classifier learning contest organized in conjunction with the KDD'99 conference was to learn a predictive model (i.e. a classifier) capable of distinguishing between legitimate and illegitimate connections in a computer network.

Tasks:https://www.kdd.org/kdd-cup/view/kdd-cup-1999/Tasks

Data:https://www.kdd.org/kdd-cup/view/kdd-cup-1999/Data

Result:

  • First Place: Bernhard Pfahringer
    Austrian Research Institute for Artificial Intelligence

  • First Runner Up: Itzhak Levin
    LLSoft, Inc. (using Kernel Miner)

  • Second Runner Up: Vladimir Miheev, Alexei Vopilov, and Ivan Shabalin**
    MP13 company, Moscow, Russia [details]

KDD Cup 2000 Online retailer website clickstream analysis web挖掘任務(根據點擊流及交易數據)

Intro:

The KDD Cup 2000 domain contains clickstream and purchase data from Gazelle.com, a legwear and legcare web retailer that closed their online store on 8/18/2000.

Data:https://www.kdd.org/kdd-cup/view/kdd-cup-2000/Data

Result:

  • Question 1 of KDD Cup 2000

    • First Place: Amdocs
    • Honorable Mentions: Mui Seng Martin Lee, Chong Jin Ong and S. Sathiya Keerthi of Mechanical and Production Engineering Department, National University of Singapore
  • Question 2 of KDD Cup 2000

    • First Place: Salford Systems, Inc
    • Honorable Mentions: MP13 team of Alexei Vopilov, Ivan Shabalin and Vladimir Mikheyev, and the team of Mukund Deshpande, George Karypis, Department of Computer Science and Engineering, University of Minnesota
  • Question 3 of KDD Cup 2000

    • First Place: Salford Systems, Inc
    • Honorable Mentions: Orit Rafaely, Tel-Aviv University and Amdocs
  • Question 4 of KDD Cup 2000

    • First Place: e-steam
    • Honorable Mentions: SAS, Amdocs, and LLSoft, Ltd

KDD Cup 2001 Molecular bioactivity; plus protein locale prediction 生物信息及醫藥(醫藥設計中的生物活性預測、預測基因/蛋白質的功能及定位)

Intro:

Because of the rapid growth of interest in mining biological databases, KDD Cup 2001 was focused on data from genomics and drug design. Sufficient (yet concise) information was provided so that detailed domain knowledge was not a requirement for entry. A total of 136 groups participated to produce a total of 200 submitted predictions over the 3 tasks: 114 for Thrombin, 41 for Function, and 45 for Localization.

Tasks:https://www.kdd.org/kdd-cup/view/kdd-cup-2001/Tasks

Data:https://www.kdd.org/kdd-cup/view/kdd-cup-2001/Data

Result:

  • Task 1 - Thrombin

    • First Place: Jie Cheng Canadian Imperial Bank of Commerce [slides]
    • Honorable Mention: T. Silander University of Helsinki
  • Task 2 - Function

    • First Place: Mark-A. Krogel University of Magdeburg [slides]
    • Honorable Mentions:

    C. Lambert (Golden Helix)

    J. Sese, H. Hayashi, and S. Morishita (University of
    Tokyo)

    D. Vogel and R. Srinivasan (A.I. Insight)

    S. Pocinki, R. Wilkinson, and P. Gaffney (Lubrizol Corp.)

  • Task 3 - Localization

    • First Place: Hisashi Hayashi, Jun Sese, and Shinichi Morishita University of Tokyo
    • Honorable Mentions:

    M. Schonlau (RAND)

    W. DuMouchel, C. Volinsky and C. Cortes (AT & T)

    B. Frasca, Z. Zheng, R. Parekh, and R. Kohavi (Blue Martini)

KDD Cup 2002 BioMed document; plus gene role classification 生物信息及文本挖掘(分子生物學領域)

Intro:

This year the competition included two tasks that involved data mining in molecular biology domains. The first task focused on constructing models that can assist genome annotators by automatically extracting information from scientific articles. The second task focused on learning models that characterize the behavior of individual genes in a hidden experimental setting. Both are described in more detail on the Tasks page.

Tasks:https://www.kdd.org/kdd-cup/view/kdd-cup-2002/Tasks

Data:https://www.kdd.org/kdd-cup/view/kdd-cup-2002/Data

Result:

  • Task 1: Information Extraction from Biomedical Articles

    • **First Place: ClearForest and Celera **

    Yizhar Regev and Michal Finkelstein

    • Honorable Mentions:

    Design Technology Institute Ltd., Department of Mechanical Engineering at the National University of Singapore and Genome Institute of Singapore (Shi Min)

    Data Mining Group, Imperial College and Inforsense Limited (Huma Lodhi and Yong Zhang)

    Verity Inc. and Exelixis, Inc. (Bin Chen)

  • Task 2: Yeast Gene Regulation Prediction

    • First Place: Adam Kowalczyk and Bhavani Raskutti

    Telstra Research Laboratories

    • Honorable Mentions:

    David Vogel and Randy Axelrod
    ;A.I. Insight Inc. and Sentara Healthcare

    Marcus Denecke, Mark-A. Krogel, Marco Landwehr and Tobias Scheffer
    Magdeburg University

    George Forman
    Hewlett Packard Laboratories

    Amal Perera, Bill Jockheck, Willy Valdivia Granda, Anne Denton, Pratap Kotala and William Perrizo
    North Dakota State University

KDD Cup 2003 Network mining and usage log analysis 網絡挖掘及使用日誌分析

Intro:

The first task involves predicting the future; contestants predict how many citations each paper will receive during the three months leading up to the KDD 2003 conference. For the second task, contestants must build a citation graph of a large subset of the archive from only the LaTex sources. In the third task, each paper's popularity will be estimated based on partial download logs. And the last task is open! Given the large amount of data, contestants can devise their own questions and the most interesting result is the winner.

Tasks:https://www.kdd.org/kdd-cup/view/kdd-cup-2003/Tasks

** Rules**https://www.kdd.org/kdd-cup/view/kdd-cup-2003/Rules

Data:https://www.kdd.org/kdd-cup/view/kdd-cup-2003/Data

Result:

  • I. Citation Prediction Task

    • First Place: J N Manjunatha, Raghavendra Pandey, Sivaramakrishnan R., and M Narasimha Murty (1329)
    • First Runner Up: Claudia Perlich, Foster Provost, and Sofus Macskassy (1360)
    • Second Runner Up: David Vogel (1398)

KDD Cup 2004 Particle physics; plus protein homology prediction 有指導分類的多種性能度量

Intro:

This year's competition focuses on data-mining for a variety of performance criteria such as Accuracy, Squared Error, Cross Entropy, and ROC Area. As described on this WWW-site, there are two main tasks based on two datasets from the areas of bioinformatics and quantum physics.

Tasks:https://www.kdd.org/kdd-cup/view/kdd-cup-2004/Tasks

Rules:https://www.kdd.org/kdd-cup/view/kdd-cup-2004/Rules

Data:https://www.kdd.org/kdd-cup/view/kdd-cup-2004/Data

Result:

  • Quantum Physics Problem

    • First Place:

    David S. Vogel, Eric Gottschalk, and Morgan C. Wang

    MEDai / A.I. Insight / University of Central Florida
    * Honorable Mention for ROC Area
    * Honorable Mention for Cross Entropy
    * Honorable Mention for SLQ Score

    • First Runner Up: Arpita Chowdhury, Dinesh Bharule, Don Yan, Lalit Wangikar Inductis Inc.
      • Honorable Mention for Accuracy
    • Second Runner Up: Christophe Lambert Golden Helix Inc.
  • Protein Homology Problem

    • First Place:

    Bernhard Pfahringer

    University of Waikato, Computer Science Department

    • Tied for 1st Place Overall:

    Yan Fu, RuiXiang Sun, Qiang Yang, Simin He, Chunli Wang, Haipeng Wang, Shiguang Shan, Junfa Liu, Wen Gao

    Institute of Computing Technology, Chinese Academy of Sciences
    * Honorable Mention for Squared Error
    * Honorable Mention for Average Precision

    • Tied for 1st Place Overall:

    David S. Vogel, Eric Gottschalk, and Morgan C. Wang

    MEDai / A.I. Insight / University of Central Florida
    * Honorable Mention for Top-1 Accuracy

    • Honorable Mention for Rank of Last: Dirk Dach, Holger Flick, Christophe Foussette, Marcel Gaspar, Daniel Hakenjos, Felix Jungermann, Christian Kullmann, Anna Litvina, Lars Michele, Katharina Morik, Martin Scholz, Siehyun Strobel, Marc Twiehaus, Nazif Veliu

    Artificial Intelligence Unit, University of Dortmund, Germany

KDD Cup 2005 Internet user search query categorization 互聯網用戶查詢分類

Intro:

This year's competition is about classifying internet user search queries. The task was specifically designed to draw participation from industry, academia, and students.

Tasks:https://www.kdd.org/kdd-cup/view/kdd-cup-2005/Tasks

Rules:https://www.kdd.org/kdd-cup/view/kdd-cup-2005/Rules

Data:https://www.kdd.org/kdd-cup/view/kdd-cup-2005/Data

Result:

  • Winners

    • Query Categorization Precision Award

    Hong Kong University of Science and Technology team

    Dou Shen, Rong Pan, Jiantao Sun, Junfeng Pan, Kangheng Wu, Jie Yin and Qiang Yang

    • Query Categorization Performance Award

    Hong Kong University of Science and Technology team

    Dou Shen, Rong Pan, Jiantao Sun, Junfeng Pan, Kangheng Wu, Jie Yin and Qiang Yang

  • Query Categorization Creativity Award
    Hong Kong University of Science and Technology team
    Dou Shen, Rong Pan, Jiantao Sun, Junfeng Pan, Kangheng Wu, Jie Yin and Qiang Yang

  • Runner-ups

  • Query Categorization Precision Award

    Budapest University of Technology team

    Zsolt T. Kardkovacs, Domonkos Tikk, Zoltan Bansaghi

  • Query Categorization Performance Award

    MEDai/AI Insight/ Humboldt University team

    David S. Vogel, Steve Bridges, Steffen Bickel, Peter Haider, Rolf Schimpfky, Peter Siemen, Tobias Scheffer

  • Query Categorization Creativity Award

    Budapest University of Technology team

    Zsolt T. Kardkovacs, Domonkos Tikk, Zoltan Bansaghi

KDD Cup 2006 Pulmonary embolisms detection from image data 醫療數據挖掘

Intro:

This year's KDD Cup challenge problem is drawn from the domain of medical data mining. The tasks are a series of Computer-Aided Detection problems revolving around the clinical problem of identifying pulmonary embolisms from three-dimensional computed tomography data. This challenging domain is characterized by:

  • Multiple instance learning
  • Non-IID examples
  • Nonlinear cost functions
  • Skewed class distributions
  • Noisy class labels
  • Sparse data

Tasks:https://www.kdd.org/kdd-cup/view/kdd-cup-2006/Tasks

Rules:https://www.kdd.org/kdd-cup/view/kdd-cup-2006/Rules

Data:https://www.kdd.org/kdd-cup/view/kdd-cup-2006/Data

Result:

  • Task 1 - PE Identification

    • First Place: Robert Bell, Patrick Haffner, and Chris Volinsky (AT & T Research)
    • First Runner Up: Dmitriy Fradkin (Ask.com)
    • Second Runner Up: Domonkos Tikk (Budapest University of Technology & Economics), Zsolt T. Kardkovacs (Budapest University of Technology & Economics), Ferenc P. Szidarovszky (Szidarovszky Ltd. and Budapest University of Technology & Economics), Gyorgy Biro (TextMiner Ltd.), and Zoltan Balint (Budapest University of Technology & Economics)
    • Best Student Entry: Karthik Kumara (team leader), Sourangshu Bhattacharya, Mehul Parsana, Shivramkrishnan K, Rashmin Babaria, Saketha Nath J, and Chiranjib Bhattacharyya (Indian Institute of Science)
  • Task 2 - Patient Classification

    • First Place: Domonkos Tikk (Budapest University of Technology & Economics), Zsolt T. Kardkovacs (Budapest University of Technology & Economics), Ferenc P. Szidarovszky (Szidarovszky Ltd. and Budapest University of Technology & Economics), Gyorgy Biro (TextMiner Ltd.), and Zoltan Balint (Budapest University of Technology & Economics)
    • First Runner Up: Ruiping Wang, Yu Su, Ting Liu, Fei Yang, Liangguo Zhang, Dong Zhang, Shiguang Shan, Weiqiang Wang, Ruixiang Sun, and Wen Gao (Institute of Computing Technology, Chinese Academy of Sciences)
    • Second Runner Up: Cas Zhang, Y. Zhou, Q. Wang, and H. Ge (Joint R & D Lab, Chinese Academy of Sciences)
    • Third Runner Up: Dmitriy Fradkin (Ask.com)
    • Best Student Entry: Zhang Cas (IA, PKU)
  • Task 3 - Negative Predictive Value

    • First Place: William Perrizo and Amal Shehan Perera (DataSURG Group, North Dakota State University)
    • Runner Up: Nimisha Gupta and Tarun Agarwal (Strand Life Sciences Pvt. Ltd.)
    • Best Student Entry: Karthik Kumara (team leader), Sourangshu Bhattacharya, Mehul Parsana, Shivramkrishnan K, Rashmin Babaria, Saketha Nath J, and Chiranjib Bhattacharyya (Indian Institute of Science)

KDD Cup 2007 Consumer recommendations 預測電影評價問題

Intro:

This year's KDD Cup focuses on predicting aspects of movie rating behavior. There are two tasks. The tasks, developed in conjunction with Netflix, have been selected to be interesting to participants from both academia and industry You can choose to compete in either or both of the tasks.

Tasks:https://www.kdd.org/kdd-cup/view/kdd-cup-2007/Tasks

Rules:https://www.kdd.org/kdd-cup/view/kdd-cup-2007/Rules

Data:https://www.kdd.org/kdd-cup/view/kdd-cup-2007/Data

Result:

  • Tasks 1 - Who Rated What

    • First Place: Miklos Kurucz, Andras A. Benczur, Tamas Kiss, Istvan Nagy, Adrienn Szabo, Balazs Torma (Hungarian Academy of Sciences)
    • First Runner Up: Advanced Analytical Solutions Team of Neo Metrics directed by Jorge Sueiras (Neo Metrics)
    • Second Runner Up: Yan Liu, Zhenzhen Kou (IBM Research)
  • Tasks 2 - How Many Ratings

    • First Place: Saharon Rosset, Claudia Perlich, Yan Liu (IBM Research)
    • First Runner Up:Advanced Analytical Solutions Team of Neo Metrics directed by Jorge Sueiras (Neo Metrics)
    • Second Runner Up: James Malaugh (Team Lead), Sachin Gangaputra, Nikhil Rastogi, Rahul Shankar, Sandeep Gupta, Kushagra Gupta, Neha Gupta, Gaurav Lal (Inductis)

KDD Cup 2008 Breast cancer 乳腺癌早期檢測問題

Intro:

The KDD Cup 2008 challenge focuses on the problem of early detection of breast cancer from X-ray images of the breast. In a screening population, a small fraction of cancerous patients have more than one malignant lesion. To simplify the problem, we only consider one type of cancer - cancerous masses - and only include cancer patients with at most one cancerous mass per patient. The challenge will consist of two parts, each of which is related to the development of algorithms for Computer Aided Detection (CAD) of early stage breast cancer from X-ray images.

Tasks:https://www.kdd.org/kdd-cup/view/kdd-cup-2008/Tasks

Data:https://www.kdd.org/kdd-cup/view/kdd-cup-2008/Data

Result:

  • Challenge 1

    • First Place: PMG-IBM-Research

    Team Members: Claudia Perlich, Prem Melville, Grzegorz Swirszcz, Yan Liu, Saharon Rosset, Richard Lawrence.

    Affiliation: IBM Research

    • First Runner Up: Hung-Yi Lo

    Team Members: Hung-Yi Lo, Chun-Min Chang, Tsung-Hsien Chiang, Cho-Yi Hsiao, Anta Huang, Tsung-Ting Kuo, Wei-Chi Lai, Ming-Han Yang, Jung-Jung Yeh, Chun-Chao Yen and Shou-De Lin

    Affiliation: National Taiwan University

    • Second Runner Up: yazhene

    Team Members: Yazhene Krishnaraj and Chandan K. Reddy

    Affiliation: Wayne State University

  • Challenge 2

    • First Place: PMG-IBM-Research

    Team Members: Claudia Perlich, Prem Melville, Grzegorz Swirszcz, Yan Liu, Saharon Rosset, Richard Lawrence.

    Affiliation: IBM Research

    • First Runner Up: TZTeam

    Team Members: Didier Baclin

    • Second Runner Up: Hung-Yi Lo

    Team Members: Hung-Yi Lo, Chun-Min Chang, Tsung-Hsien Chiang, Cho-Yi Hsiao, Anta Huang, Tsung-Ting Kuo, Wei-Chi Lai, Ming-Han Yang, Jung-Jung Yeh, Chun-Chao Yen and Shou-De Lin

    Affiliation: National Taiwan University

KDD Cup 2009 Customer relationship prediction 電信運營商客戶行為預測

Intro:

Customer Relationship Management (CRM) is a key element of modern marketing strategies. The KDD Cup 2009 offers the opportunity to work on large marketing databases from the French Telecom company Orange to predict the propensity of customers to switch provider (churn), buy new products or services (appetency), or buy upgrades or add-ons proposed to them to make the sale more profitable (up-selling).

The most practical way, in a CRM system, to build knowledge on customer is to produce scores. A score (the output of a model) is an evaluation for all instances of a target variable to explain (i.e. churn, appetency or up-selling). Tools which produce scores allow to project, on a given population, quantifiable information. The score is computed using input variables which describe instances. Scores are then used by the information system (IS), for example, to personalize the customer relationship. An industrial customer analysis platform able to build prediction models with a very large number of input variables has been developed by Orange Labs. This platform implements several processing methods for instances and variables selection, prediction and indexation based on an efficient model combined with variable selection regularization and model averaging method. The main characteristic of this platform is its ability to scale on very large datasets with hundreds of thousands of instances and thousands of variables. The rapid and robust detection of the variables that have most contributed to the output prediction can be a key factor in a marketing application.

The challenge is to beat the in-house system developed by Orange Labs. It is an opportunity to prove that you can deal with a very large database, including heterogeneous noisy data (numerical and categorical variables), and unbalanced class distributions. Time efficiency is often a crucial point. Therefore part of the competition will be time-constrained to test the ability of the participants to deliver solutions quickly.

Tasks:https://www.kdd.org/kdd-cup/view/kdd-cup-2009/Tasks

Rules:https://www.kdd.org/kdd-cup/view/kdd-cup-2009/Rules

Data:https://www.kdd.org/kdd-cup/view/kdd-cup-2009/Data

Result:

  • Fast Track

    • First Place: IBM Research

    Ensemble Selection for the KDD Cup Orange Challenge

    • First Runner Up: ID Analytics, Inc

    KDD Cup Fast Scoring on a Large Database

    • Second Runner Up: Old dogs with new tricks (David Slate, Peter W. Frey)
  • Slow Track

    • First Place: University of Melbourne

    University of Melbourne entry

    • First Runner Up: Financial Engineering Group, Inc. Japan

    Stochastic Gradient Boosting

    • Second Runner Up: National Taiwan University, Computer Science and Information Engineering

    Fast Scoring on a Large Database using regularized maximum entropy model, categorical/numerical balanced AdaBoost and selective Naive Bayes

KDD Cup 2010 Student performance evaluation 根據智能教學係統和學生之間的交互日誌,來預測學生在數學題測驗上的表現

Intro:

How generally or narrowly do students learn? How quickly or slowly? Will the rate of improvement vary between students? What does it mean for one problem to be similar to another? It might depend on whether the knowledge required for one problem is the same as the knowledge required for another. But is it possible to infer the knowledge requirements of problems directly from student performance data, without human analysis of the tasks?

This year’s challenge asks you to predict student performance on mathematical problems from logs of student interaction with Intelligent Tutoring Systems. This task presents interesting technical challenges, has practical importance, and is scientifically interesting.

賽題介紹
根據智能教學輔導係統和學生之間的交互日誌,來預測學生數學題的考試成績。該任務兼具實踐重要性和科學趣味性。競賽提供3個開發(develop)數據集和2個挑戰(challenge)數據集,每個數據集又分為訓練(train)部分和測試(test)部分。Challenge數據集的test部分被隱藏,參賽者需要開發一種學習模型,來準確預測這部分隱藏部分的成績。

Tasks:https://www.kdd.org/kdd-cup/view/kdd-cup-2010-student-performance-evaluation/Tasks

Rules:https://www.kdd.org/kdd-cup/view/kdd-cup-2010-student-performance-evaluation/Rules

Data:https://www.kdd.org/kdd-cup/view/kdd-cup-2010-student-performance-evaluation/Data

Result:

  • All Teams

    • First Place: National Taiwan University

    Feature engineering and classifier ensembling for KDD CUP 2010

    • First Runner Up: Zhang and Su

    Gradient Boosting Machines with Singular Value Decomposition

    • Second Runner Up: BigChaos @ KDD

    Collaborative Filtering Applied to Educational Data Mining

  • Student Teams

    • First Place: National Taiwan University

    Feature engineering and classifier ensembling for KDD CUP 2010

    • First Runner Up: Zach A. Pardos

    Using HMMs and bagged decision trees to leverage rich features of user and skill

    • Second Runner Up: SCUT Data Mining

    Split-Score-Predicate

KDD Cup 2011 Predict music ratings and identify favorite songs 音樂評分預測,識別音樂是否被用戶評分

Intro:

  • Learn the rhythm, predict the musical scores

People have been fascinated by music since the dawn of humanity. A wide variety of music genres and styles has evolved, reflecting diversity in personalities, cultures and age groups. It comes as no surprise that human tastes in music are remarkably diverse, as nicely exhibited by the famous quotation: "We don't like their sound, and guitar music is on the way out" (Decca Recording Co. rejecting the Beatles, 1962).

Yahoo! Music has amassed billions of user ratings for musical pieces. When properly analyzed, the raw ratings encode information on how songs are grouped, which hidden patterns link various albums, which artists complement each other, and above all, which songs users would like to listen to.

Such an exciting analysis introduces new scientific challenges. The KDD Cup contest releases over 300 million ratings performed by over 1 million anonymized users. The ratings are given to different types of items-songs, albums, artists, genres-all tied together within a known taxonomy.

  • Two Tracks

The competition is divided into two tracks:

The first track is aimed at predicting scores that users gave to various items.
The second track requires separation of loved songs from other songs.

Both tracks are open to all research groups in academia and industry.

The KDD Cup 2011 files are currently offline.

賽題介紹

Track1任務:Predicting scores that users gave to various items
(音樂評分預測)

根據用戶在雅虎音樂上item的曆史評分記錄,來預測用戶對其他item(包括歌曲、專輯等)的評分和實際評分之間的差異RMSE(最小均方誤差)。同時提供的還有歌曲所屬的專輯、歌手、曲風等信息
Track2任務:Separation of loved songs from other songs
(識別音樂是否被用戶評分)

每個用戶提供6首候選的歌曲,其中3首為用戶已評分數據,另3首是該用戶未評分,但是出自用戶中整體評分較高的歌曲。歌曲的屬性信息(專輯、歌手、曲風等)也同樣提供。參賽者給出二分分類結果(0/1分類),並根據整體準確率計算最終排名

該賽題官方已下線,無數據集下載

KDD Cup 2012 (Track 1) Predict which users (or information sources) one user might follow in Tencent Weibo 社交網絡中的個性化推薦係統

Intro:

Online social networking services have become tremendously popular in recent years, with popular social networking sites like Facebook, Twitter, and Tencent Weibo adding thousands of enthusiastic new users each day to their existing billions of actively engaged users. Since its launch in April 2010, Tencent Weibo, one of the largest micro-blogging websites in China, has become a major platform for building friendship and sharing interests online. Currently, there are more than 200 million registered users on Tencent Weibo, generating over 40 million messages each day. This scale benefits the Tencent Weibo users but it can also flood users with huge volumes of information and hence puts them at risk of information overload. Reducing the risk of information overload is a priority for improving the user experience and it also presents opportunities for novel data mining solutions. Thus, capturing users’ interests and accordingly serving them with potentially interesting items (e.g. news, games, advertisements, products), is a fundamental and crucial feature social networking websites like Tencent Weibo.

More information on KDD Cup 2012 (Track 1) can be found at Kaggle.com

賽題介紹

Track1任務:Predict which users(or information sources) one user might follow in Tencent
(社交網絡中的個性化推薦係統)

根據騰訊微博中的用戶屬性(User Profile)、SNS社交關係、在社交網絡中的互動記錄(retweet、comment、at)等,以及過去30天內的曆史item推薦記錄,來預測接下來最有可能被用戶接受的推薦item列表

大賽官網介紹
https://www.kaggle.com/c/kddcup2012-track1

大賽數據集
https://www.kaggle.com/c/kddcup2012-track1/data

KDD Cup 2012 (Track 2) Predict the click-through rate of ads given the query and user information 搜索廣告係統的pTCR點擊率預估

Intro:

Search advertising has been one of the major revenue sources of the Internet industry for years. A key technology behind search advertising is to predict the click-through rate (pCTR) of ads, as the economic model behind search advertising requires pCTR values to rank ads and to price clicks. In this task, given the training instances derived from session logs of the Tencent proprietary search engine, soso.com, participants are expected to accurately predict the pCTR of ads in the testing instances.

More information on KDD Cup 2012 (Track 2) can be found at Kaggle.com

賽題介紹

Track2任務:Predict the click-through rate of ads given the query and user information
(搜索廣告係統的pTCR點擊率預估)

提供用戶在騰訊搜索的查詢詞(query)、展現的廣告信息(包括廣告標題、描述、url等),以及廣告的相對位置(多條廣告中的排名)和用戶點擊情況,以及廣告主和用戶的屬性信息,來預測後續時間用戶對廣告的點擊情況

大賽官網介紹
https://www.kaggle.com/c/kddcup2012-track2

大賽數據集
https://www.kaggle.com/c/kddcup2012-track2/data

**KDD Cup 2013 (Track 1) Determine whether an author has written a given paper **

Intro:

The ability to search literature and collect/aggregate metrics around publications is a central tool for modern research. Both academic and industry researchers across hundreds of scientific disciplines, from astronomy to zoology, increasingly rely on search to understand what has been published and by whom.

Microsoft Academic Search is an open platform that provides a variety of metrics and experiences for the research community, in addition to literature search. It covers more than 50 million publications and over 19 million authors across a variety of domains, with updates added each week. One of the main challenges of providing this service is caused by author-name ambiguity. On one hand, there are many authors who publish under several variations of their own name. On the other hand, different authors might share a similar or even the same name.

As a result, the profile of an author with an ambiguous name tends to contain noise, resulting in papers that are incorrectly assigned to him or her. This KDD Cup task challenges participants to determine which papers in an author profile were truly written by a given author.

More information on KDD Cup 2013 (Track 1) can be found at Kaggle.com

賽題介紹

Track1任務:Author-Paper Identification Challenge

微軟學術搜索是一個開放的平台,它涵蓋了各種學術領域超過5000萬的出版物和1900多萬作者,並保持著每周更新的速度。提供這項服務的主要挑戰之一是作者名稱的歧義。一方麵,很多作者傾向於使用不同的筆名。另一方麵,不同的作者可能有一個相似甚至相同的名字。
因此,名字有歧義的作者往往會導致作品與作者對應問題。本屆挑戰要求參與者能在作者檔案中識別出本人所著論文。

大賽官網介紹
https://www.kaggle.com/c/kdd-cup-2013-author-paper-identification-challenge

大賽數據集
https://www.kaggle.com/c/kdd-cup-2013-author-paper-identification-chal
lenge/data

**KDD Cup 2013(Track 2) Identify which authors correspond to the same person **

Intro:

The ability to search literature and collect/aggregate metrics around publications is a central tool for modern research. Both academic and industry researchers across hundreds of scientific disciplines, from astronomy to zoology, increasingly rely on search to understand what has been published and by whom.

Microsoft Academic Search is an open platform that provides a variety of metrics and experiences for the research community, in addition to literature search. It covers more than 50 million publications and over 19 million authors across a variety of domains, with updates added each week. One of the main challenges of providing this service is caused by author-name ambiguity. This KDD Cup task challenges participants to determine which authors in a given data set are duplicates.

More information on KDD Cup 2013 (Track 2) can be found at Kaggle.com

賽題介紹

Track2任務:Author Disambiguation Challenge

本屆挑戰要求參與者能在數據集中辨別出哪些作者是同一個人。

大賽官網介紹
https://www.kaggle.com/c/kdd-cup-2013-author-disambiguation

大賽數據集
https://www.kaggle.com/c/kdd-cup-2013-author-disambiguation/data

KDD Cup 2014 Predict funding requests that deserve an A+ 幫助一個慈善網站識別出那些格外激動人心的項目

Intro:

DonorsChoose.org is an online charity that makes it easy to help students in need through school donations. At any time, thousands of teachers in K-12 schools propose projects requesting materials to enhance the education of their students. When a project reaches its funding goal, they ship the materials to the school.

The 2014 KDD Cup asks participants to help DonorsChoose.org identify projects that are exceptionally exciting to the business, at the time of posting. While all projects on the site fulfill some kind of need, certain projects have a quality above and beyond what is typical. By identifying and recommending such projects early, they will improve funding outcomes, better the user experience, and help more students receive the materials they need to learn.

Successful predictions may require a broad range of analytical skills, from natural language processing on the need statements to data mining and classical supervised learning on the descriptive factors around each project.

賽題介紹
KDD Cup2014要求參賽者幫助慈善網站DonorsChoose.org挑選有商業亮點的項目,所有項目都能滿足某些特定需求,但是隻有個別項目能大幅度超過平均水準。通過早期識別和推薦這些項目,他們能夠獲得更多的資金注入、更好的用戶體驗,同時幫助更多的學生獲得他們需要的學習材料。

大賽官網介紹
https://www.kaggle.com/c/kdd-cup-2014-predicting-excitement-at-donors-choose

大賽數據集
https://www.kaggle.com/c/kdd-cup-2014-predicting-excitement-at-donors-choose/data

KDD Cup 2015 Predicting dropouts in MOOC用大數據預測MOOCer是否會“翹課”

Intro:

Students' high dropout rate on MOOC platforms has been heavily criticized, and predicting their likelihood of dropout would be useful for maintaining and encouraging students' learning activities. Therefore, in KDD Cup 2015, we will predict dropout on XuetangX, one of the largest MOOC platforms in China.

The competition participants need to predict whether a user will drop a course within next 10 days based on his or her prior activities. If a user C leaves no records for course C in the log during the next 10 days, we define it as dropout from course C For more details about log, please refer to the Data Descriptions.

賽題介紹
MOOC在線學習平台上學生的逃課率極高,因此預測他們接下來是否會選擇逃課將對保持和激勵學生的學習積極性十分有益。在KDD Cup 2015,我們的主題在於預測學生在學堂在線這個全中國最大幕課平台中的逃課率。參賽者需要基於用戶個人行為預測接下來10天內他們的逃課幾率。

大賽官網介紹
https://www.kddcup2015.com/information.html

大賽數據集
https://data-mining.philippe-fournier-viger.com/the-kddcup-2015-dataset-download-link/

**KDD Cup 2016 Whose papers are accepted the most: towards measuring the impact of research institutions **

Intro:

Finding influential nodes in a social network for identifying patterns or maximizing information diffusion has been an actively researched area with many practical applications. In addition to the obvious value to the advertising industry, the research community has long sought mechanisms to effectively disseminate new scientific discoveries and technological breakthroughs so as to advance our collective knowledge and elevate our civilization. For students, parents and funding agencies that are planning their academic pursuits or evaluating grant proposals, having an objective picture of the institutions in question is particularly essential. Partly against this backdrop we have witnessed that releasing a yearly Research Institution or University Ranking has become a tradition for many popular newspapers, magazines and academic institutes. Such rankings not only attract attention from governments, universities, students and parents, but also create debates on the scientific correctness behind the rankings. The most criticized aspect of these rankings is: the data used and the methodology employed for the ranking are mostly unknown to the public.

The 2016 KDD Cup will address this very important problem through publically available datasets, like the Microsoft Academic Graph (MAG), a freely available dataset that includes information on academic publications and citations. This dataset, being a heterogeneous graph, that can be used to study the influential nodes of various types including authors, affiliations and venues; we choose to focus on affiliations in this competition. In effect, given a research field, we are challenging the KDD Cup community to jointly develop data mining techniques to identify the best research institutions based on their publication and how they are cited in research articles.

Tasks:https://www.kdd.org/kdd-cup/view/kdd-cup-2016/Tasks

Rules:https://www.kdd.org/kdd-cup/view/kdd-cup-2016/Rules

Data:https://www.kdd.org/kdd-cup/view/kdd-cup-2016/Data

KDD Cup 2017Highway Tollgates Traffic Flow Prediction —— Travel Time & Traffic Volume Prediction

Intro:

Highway tollgates are well known bottlenecks in traffic networks. During rush hours, long queues at tollgates can overwhelm traffic management authorities. Effective preemptive countermeasures are desired to solve this challenge. Such countermeasures include expediting the toll collection process and streamlining future traffic flow. The expedition of toll collection could be simply allocating temporary toll collectors to open more lanes. Future traffic flow could be streamlined by adaptively tweaking traffic signals at upstream intersections. Preemptive countermeasures will only work when the traffic management authorities receive reliable predictions for future traffic flow. For example, if heavy traffic in the next hour is predicted, then traffic regulators could immediately deploy additional toll collectors and/or divert traffic at upstream intersections.
Traffic flow patterns vary due to different stochastic factors, such as weather conditions, holidays, time of the day, etc. The prediction of future traffic flow and ETA (Estimated Time of Arrival) is a known challenge. An unprecedented large amount of traffic data from mobile apps such as Waze (in the US) or Amap (in China) can help us take up that challenge. If the contestants in this proposed KDD CUP could design reliable approaches for future traffic flow and ETA prediction, then the traffic management authorities might be able to capitalize on big data & algorithms for fewer congestions at tollgates.

賽題介紹

高速公路收費站是交通網絡中眾所周知的瓶頸。如果可以提前預測接下來一小時的交通擁堵狀況,那麼交通管理部門可以及時采取措施進行上遊路口的流量誘導和控製。KDD CUP 2017希望參賽者可以設計一套預測交通流量和車輛到達時間的算法,用算法和數據來賦能交通領域,減少擁堵的發生。

Task 1: To estimate the average travel time from designated intersections to tollgates(預測車輛從路口到收費站的平均用時)

Task 2: To predict average tollgate traffic volume(高速收費站車流量預測)

最後更新:2017-08-17 17:32:25

  上一篇:go  第一篇博文
  下一篇:go  一文讓你了解汽車發動機運行下需要哪些傳感器配合工作