Applying KNN To MNIST Dataset

28 Sep 2014

The MNIST is a set of handwritten digits images. You can download it by using Makefile:

%.gz:
    wget http://yann.lecun.com/exdb/mnist/$*.gz

%.idx: %.gz
    gzip -d $*.gz
    mv $* $*.idx

prepare: t10k-images-idx3-ubyte.idx t10k-labels-idx1-ubyte.idx train-images-idx3-ubyte.idx train-labels-idx1-ubyte.idx

The size of each image is 28×28 pixels.

The IDX File Format

The MNIST dataset are stored in IDX file format. The basic format is:

magic number 
size in dimension 0 
size in dimension 1 
size in dimension 2 
..... 
size in dimension N 
data

The first 4 bytes of a IDX file is magic number:

typedef struct magic_number_s {
  char byte1;       // always 0
  char byte2;       // always 0
  char type;
  char dimentions;
} magic_number_t;

The first 2 bytes of magic number are always 0. The third byte codes the type of the data:

0x08: unsigned byte 
0x09: signed byte 
0x0B: short (2 bytes) 
0x0C: int (4 bytes) 
0x0D: float (4 bytes) 
0x0E: double (8 bytes)

For MNIST dataset, the type is unsigned byte.

The 4-th byte codes the number of dimensions of the vector/matrix. For image, the number of dimension is 3; for label, the number of dimension is 1.

The handwrite digits is formated as:

typedef struct images_s{
  magic_number_t magic_number;
  unsigned int number_of_images;
  unsigned int number_of_rows;
  unsigned int number_of_columns;
  unsigned char images[1]; // 28x28x[number of images]
} images_t;

The size of a single image in MNIST dataset is 28*28. Get the image date of a given index by:

#define get_img(head, index) (head + 28*28*index)

The label, represent the results(digits) of handwrite digits, is formated as:

typedef struct labels_s{
  magic_number_t magic_number;
  unsigned int number_of_items;
  unsigned char labels[1]; // [number of labels]
} labels_t;

Bigendian

The MNIST dataset is stored in bigendian. If you want to get the right number of items, which is unsigned int, you may need to convert it to littleendian.

// The sizes in each dimension are 4-byte integers (MSB first, high endian, like in most non-Intel processors).
int is_bigendian() {
  int i = 1;
  char *p = (char *)&i;

  if (p[0] == 1)
    return 0;
  else
    return 1;
}

unsigned int bit32conversion(unsigned int num) {
  return ((num>>24)&0xff) | ((num<<8)&0xff0000) | ((num>>8)&0xff00) | ((num<<24)&0xff000000);
}

if(!is_bigendian()) {
  number_of_items    = bit32conversion(*number_of_items);
}

The Distance of Images

There are multiple ways to caculate the distance of two images, one simple way is to using Euclidean distance.

// caculate the euclidean distance
double distance(unsigned char img1[], unsigned char img2[]) {
  int i,j;
  double sum = 0, value;
  for(i = 0, j= 28*28; i < j; i++) {
    value = img1[i] - img2[i];
    sum = sum + value*value;
  }
  return sum;
}

Multi-thread Design

In this design, finding the k-Nearest Neighbors in the train set for all the images in the test set is a time consuming process. We can design multi-thread program, in which each thread compute only a subset of the test set. The working structure for the thread can be designed as:

// the k-nearest neighbors is here.
typedef struct knn_s{
  int index;
  double distance;
} knn_t;

typedef struct cfg_s{
  knn_t * knn;
  unsigned int number_of_neighbor;
  unsigned int k;
  unsigned int train_start, train_stop, test_start, test_stop;
  unsigned int total;
  unsigned int * hit;
  unsigned int thread;
} cfg_t;

In the multi-thread design, the program can utilize all the cores of the CUPs.

Performance Evaluate

The performance we evaluate here is the hit ratio, which indicate the percentage the program success in telling the right digit from the image.

k hit rate (%)
1 97.91
2 96.27
3 97.05
4 96.82
5 96.88
6 96.77
7 96.94
8 96.70
9 96.59
10 96.65

The best performance is get when k = 3.

The complete source code can be got from: https://github.com/janzhou/cap5610