Partial results for KDD cup 2004

Contributors: Ruben H. Zamar, William J. Welch, Guohua Yan, Hui Shen, Ying Lin, Weiliang Qiu, Fei Yuan

Original data sets

Problem	Response	Explanatory (descriptor) variables	Folds for cross validation
Protein Homology ( Explanation )	kdd_act.txt	kdd_train.txt	kdd_fold.txt Code for generating 2 and 153 folds
Reorganize kdd datasets: kdd_organize_data.R

Results for the KDD data

Methods	R Code	Plots (*.pdf)	Numerical results
May 27 2004 Submitted by Fei: compare the blocks in the training/test data; find out how class 0 and 1 are distributed in the blocks of training set
May 27 2004 Submitted by Hui: plot "Hit Rate ~ Block ID" using all training data ( HitRate.pdf )
May 27 2004 Submitted by Yi: plot the Kernel densities of 74 original explanatory variables using all training data
May 27 2004 Submitted by Fei: plot the Kernel densities of 74 original explanatory variables using block 7 and 244 from training data
May 27 2004 Submitted by Yi: plot the Kernel densities of first 15 PCs, which were calculated using all training data
May 27 2004 Submitted by Fei: plot the Kernel densities of first 15 PCs, which were calculated by randomly sampling five blocks from 153 blocks of all training data
May 27 2004 Submitted by Guohua: how to use Perf? Answer:perfMeas.pdf
May 31 2004 Submitted by Fei: try to find important variables by applying tree methods on single blocks/the whole training data( unpruned / pruned trees) Results for Blocks 7 (Unpruned and pruned tree: 5 varabiles ): x3, x4, x5, x8, x9 Results for Blocks 244 (Unpruned: 12 variables ): x3, x21, x29, x45, x51, x52, x53, x54, x55, x58, x60, x74 Results for Blocks 244 (Pruned: 9 variables ): x3, x29, x45,x53, x54, x55, x58, x60, x74 Results for the whole training dataset (Pruned: 18 variables ): x5, x11, x28, x33, x35, x38, x40, x45, x50, x53, x55, x57, x58, x59, x60, x63, x68, x73
June 3 2004 Submitted by Fei: Randomly sample 76 blocks from the 153 blocks of the training dataset and store the block numbers of the sample into kddSamplBlocks.mtx How to load this file into R ? Answer: source("http://hajek.stat.ubc.ca/~fyuan/rcode/readmtx.R") sampleBlocks<-read.mtx("kddSamplBlocks.mtx")
June 4 2004 Submitted by GuoHua: plot boxplot for each original predictor divide blocks into four groups by hit rates; plot boxplot for each original predictor based on each group; Result: no trend found. Scatter plot of block means vs hit rate for each original predictor
my.R
my.R
June 14 2004 Submitted by Fei: How to call this Perf() in R? Step 1: Download myperf3.o into your own directory Step 2: Store estimated probabilties into two files which have similar format as temp0.txt and temp1.txt or one file that follows the same format as temp.txt Step 3: Run the following R codes: Choice 1: dyn.load("myperf3.o") MyArgv<-c("perf","-top1","-rms","-rkl","-apr","-blocks","-files","./temp0.txt", "./temp1.txt") MyArgv<-as.character(MyArgv) MyArgc<-length(MyArgv) MyArgc<-as.integer(MyArgc) myout<-rep(0.0,4) storage.mode(myout)<-"double" res<-.C("perf", MyArgc,MyArgv,out=myout)$out Choice 2: dyn.load("myperf3.o") MyArgv<-c("perf","-top1","-rms","-rkl","-apr","-blocks","-file","./temp.txt") MyArgv<-as.character(MyArgv) MyArgc<-length(MyArgv) MyArgc<-as.integer(MyArgc) myout<-rep(0.0,4) storage.mode(myout)<-"double" res<-.C("perf", MyArgc,MyArgv,out=myout)$out Sample results: > res<-.C("perf", MyArgc,MyArgv,out=myout)$out MEAN_BLOCK_APR 0.25000 MEAN_BLOCK_RKL 2.00000 MEAN_BLOCK_RMS 0.57614 MEAN_BLOCK_TOP1 0.50000 > res [1] 0.2500000 2.0000000 0.5761375 0.5000000 > Notes: The returned values are store in the vector "res" in the order: MEAN_BLOCK_APR, MEAN_BLOCK_RKL, MEAN_BLOCK_RMS, MEAN_BLOCK_TOP1
June 22 2004 Submitted by Fei: Results for 2-fold crossvalidation LDA ( download kdd_lda_whole.txt ) > res<-.C("perf", MyArgc,MyArgv,out=myout)$out MEAN_BLOCK_APR 0.45452 MEAN_BLOCK_RKL 338.35948 MEAN_BLOCK_RMS 0.04338 MEAN_BLOCK_TOP1 0.83660
June 22 2004 Submitted by Fei: Results for nearest neighbor logistic regression ( download kdd_log_weiliang.txt ) res<-.C("perf", MyArgc,MyArgv,out=myout)$out MEAN_BLOCK_APR 0.47385 MEAN_BLOCK_RKL 172.33987 MEAN_BLOCK_RMS 0.04565 MEAN_BLOCK_TOP1 0.66667
June 22 2004 Submitted by Fei: Functions that are borrowed from Weiliang for calcalating nearest neighbor logistic regression matrix_position.R separation.R CallWei.R
June 24 2004 Submitted by Yi: kel function: Step 1: download mytest.o into your own directory Step 2: running kel function kel.txt
June 30 2004 Submitted by Fei: How to assemble 2-fold esitmated probabilities? Answer: sample code
July 5 2004 Submitted by Yi: Format of the results for each case by all four methods: Col1: Block ID and hit case ID Col2 to Col6: the probabilities of each case in class 1 for Kernel estimate(Ker), Lda, KNN, outlier approach(Maha) and Logistic method(Logis) Col7 to Col11: the rank of of each case in class 1 according to probabilities for Kernel estimate(Ker), Lda, KNN, outlier approach(Maha) and Logistic method(Logis)
July 5 2004 Submitted by Fei: Download new Perf: newperf.o
July 9 2004 Submitted by Fei: Test four subset of explanatory variables using 153-fold cross-validation: result
July 9 2004 Submitted by Guohua: Guohua's subsets of explanatory variables: subsetOfXs.txt
July 12 2004 Submitted by Fei: some probabilities: kdd_153_log_var.txt
July 13 2004 Submitted by Fei & Hui: Results for the new subset of Xs: kdd-cv-logis.doc

Website Constructed By Fei Yuan @ The Department of Statistics
The University of British Columbia
2004

Partial results for KDD cup 2004

Original data sets

Results for the KDD data

Step 1:

Step 2:

Step 3:

Sample results:

Notes: