Partial results for KDD cup 2004

Contributors: Ruben H. Zamar, William J. Welch, Guohua Yan, Hui Shen, Ying Lin, Weiliang Qiu, Fei Yuan

Original data sets

  • training data
  • test data

  • Problem Response Explanatory (descriptor) variables Folds for cross validation
    Protein Homology
    ( Explanation )
    kdd_act.txt kdd_train.txt
  • kdd_fold.txt
  • Code for generating 2 and 153 folds
  • Reorganize kdd datasets: kdd_organize_data.R

    Results for the KDD data

    Methods R Code Plots (*.pdf) Numerical results
    May 27 2004 Submitted by Fei: compare the blocks in the training/test data; find out how class 0 and 1 are distributed in the blocks of training set
    May 27 2004 Submitted by Hui: plot "Hit Rate ~ Block ID" using all training data ( HitRate.pdf )
    May 27 2004 Submitted by Yi: plot the Kernel densities of 74 original explanatory variables using all training data
    May 27 2004 Submitted by Fei: plot the Kernel densities of 74 original explanatory variables using block 7 and 244 from training data
    May 27 2004 Submitted by Yi: plot the Kernel densities of first 15 PCs, which were calculated using all training data
    May 27 2004 Submitted by Fei:
    plot the Kernel densities of first 15 PCs, which were calculated by randomly sampling five blocks from 153 blocks of all training data
    May 27 2004 Submitted by Guohua: how to use Perf? Answer:perfMeas.pdf
    May 31 2004 Submitted by Fei: try to find important variables by applying tree methods on single blocks/the whole training data( unpruned / pruned trees)
  • Results for Blocks 7 (Unpruned and pruned tree: 5 varabiles ): x3, x4, x5, x8, x9
  • Results for Blocks 244 (Unpruned: 12 variables ): x3, x21, x29, x45, x51, x52, x53, x54, x55, x58, x60, x74
  • Results for Blocks 244 (Pruned: 9 variables ): x3, x29, x45,x53, x54, x55, x58, x60, x74
  • Results for the whole training dataset (Pruned: 18 variables ): x5, x11, x28, x33, x35, x38, x40, x45, x50, x53, x55, x57, x58, x59, x60, x63, x68, x73
  • June 3 2004 Submitted by Fei:
    Randomly sample 76 blocks from the 153 blocks of the training dataset and store the block numbers of the sample into kddSamplBlocks.mtx
    How to load this file into R ?
    Answer:
    source("http://hajek.stat.ubc.ca/~fyuan/rcode/readmtx.R")
    sampleBlocks<-read.mtx("kddSamplBlocks.mtx")
    June 4 2004 Submitted by GuoHua:
  • plot boxplot for each original predictor
  • divide blocks into four groups by hit rates; plot boxplot for each original predictor based on each group; Result: no trend found.
  • Scatter plot of block means vs hit rate for each original predictor
  • my.R
    my.R
    June 14 2004 Submitted by Fei:
    How to call this Perf() in R?

  • Step 1:
  • Download myperf3.o into your own directory

  • Step 2:
  • Store estimated probabilties into two files which have similar format as temp0.txt and temp1.txt or one file that follows the same format as temp.txt

  • Step 3:
  • Run the following R codes:

    Choice 1:
    dyn.load("myperf3.o")
    MyArgv<-c("perf","-top1","-rms","-rkl","-apr","-blocks","-files","./temp0.txt", "./temp1.txt")
    MyArgv<-as.character(MyArgv)
    MyArgc<-length(MyArgv)
    MyArgc<-as.integer(MyArgc)
    myout<-rep(0.0,4)
    storage.mode(myout)<-"double"
    res<-.C("perf", MyArgc,MyArgv,out=myout)$out

    Choice 2:
    dyn.load("myperf3.o")
    MyArgv<-c("perf","-top1","-rms","-rkl","-apr","-blocks","-file","./temp.txt")
    MyArgv<-as.character(MyArgv)
    MyArgc<-length(MyArgv)
    MyArgc<-as.integer(MyArgc)
    myout<-rep(0.0,4)
    storage.mode(myout)<-"double"
    res<-.C("perf", MyArgc,MyArgv,out=myout)$out

    Sample results:

    > res<-.C("perf", MyArgc,MyArgv,out=myout)$out
    MEAN_BLOCK_APR 0.25000
    MEAN_BLOCK_RKL 2.00000
    MEAN_BLOCK_RMS 0.57614
    MEAN_BLOCK_TOP1 0.50000
    > res
    [1] 0.2500000 2.0000000 0.5761375 0.5000000
    >

    Notes:

    The returned values are store in the vector "res" in the order: MEAN_BLOCK_APR, MEAN_BLOCK_RKL, MEAN_BLOCK_RMS, MEAN_BLOCK_TOP1
    June 22 2004 Submitted by Fei: Results for 2-fold crossvalidation LDA ( download kdd_lda_whole.txt )
    > res<-.C("perf", MyArgc,MyArgv,out=myout)$out
    MEAN_BLOCK_APR 0.45452
    MEAN_BLOCK_RKL 338.35948
    MEAN_BLOCK_RMS 0.04338
    MEAN_BLOCK_TOP1 0.83660
    June 22 2004 Submitted by Fei: Results for nearest neighbor logistic regression ( download kdd_log_weiliang.txt )
    res<-.C("perf", MyArgc,MyArgv,out=myout)$out
    MEAN_BLOCK_APR 0.47385
    MEAN_BLOCK_RKL 172.33987
    MEAN_BLOCK_RMS 0.04565
    MEAN_BLOCK_TOP1 0.66667
    June 22 2004 Submitted by Fei: Functions that are borrowed from Weiliang for calcalating nearest neighbor logistic regression
    matrix_position.R
    separation.R
    CallWei.R
    June 24 2004 Submitted by Yi: kel function:
  • Step 1: download mytest.o into your own directory
  • Step 2: running kel function kel.txt
  • June 30 2004 Submitted by Fei:
    How to assemble 2-fold esitmated probabilities?
    Answer: sample code
    July 5 2004 Submitted by Yi:
    Format of the results for each case by all four methods:
    Col1: Block ID and hit case ID
    Col2 to Col6: the probabilities of each case in class 1 for Kernel estimate(Ker), Lda, KNN, outlier approach(Maha) and Logistic method(Logis)
    Col7 to Col11: the rank of of each case in class 1 according to probabilities for Kernel estimate(Ker), Lda, KNN, outlier approach(Maha) and Logistic method(Logis)
    July 5 2004 Submitted by Fei:
    Download new Perf: newperf.o
    July 9 2004 Submitted by Fei:
    Test four subset of explanatory variables using 153-fold cross-validation: result
    July 9 2004 Submitted by Guohua:
    Guohua's subsets of explanatory variables: subsetOfXs.txt
    July 12 2004 Submitted by Fei:
    some probabilities: kdd_153_log_var.txt
    July 13 2004 Submitted by Fei & Hui:
    Results for the new subset of Xs: kdd-cv-logis.doc


















    Website Constructed By Fei Yuan @ The Department of Statistics
    The University of British Columbia
    2004