# Applying KNN To MNIST Dataset

28 Sep 2014The MNIST is a set of handwritten digits images. You can download it by using `Makefile`

:

```
%.gz:
wget http://yann.lecun.com/exdb/mnist/$*.gz
%.idx: %.gz
gzip -d $*.gz
mv $* $*.idx
prepare: t10k-images-idx3-ubyte.idx t10k-labels-idx1-ubyte.idx train-images-idx3-ubyte.idx train-labels-idx1-ubyte.idx
```

The size of each image is 28×28 pixels.

## The IDX File Format

The MNIST dataset are stored in IDX file format. The basic format is:

```
magic number
size in dimension 0
size in dimension 1
size in dimension 2
.....
size in dimension N
data
```

The first 4 bytes of a IDX file is `magic number`

:

```
typedef struct magic_number_s {
char byte1; // always 0
char byte2; // always 0
char type;
char dimentions;
} magic_number_t;
```

The first 2 bytes of `magic number`

are always 0. The third byte codes the type of the data:

```
0x08: unsigned byte
0x09: signed byte
0x0B: short (2 bytes)
0x0C: int (4 bytes)
0x0D: float (4 bytes)
0x0E: double (8 bytes)
```

For MNIST dataset, the type is `unsigned byte`

.

The 4-th byte codes the number of dimensions of the vector/matrix. For image, the number of dimension is 3; for label, the number of dimension is 1.

The `handwrite digits`

is formated as:

```
typedef struct images_s{
magic_number_t magic_number;
unsigned int number_of_images;
unsigned int number_of_rows;
unsigned int number_of_columns;
unsigned char images[1]; // 28x28x[number of images]
} images_t;
```

The size of a single image in MNIST dataset is `28*28`

. Get the image date of a given `index`

by:

```
#define get_img(head, index) (head + 28*28*index)
```

The label, represent the results(digits) of `handwrite digits`

, is formated as:

```
typedef struct labels_s{
magic_number_t magic_number;
unsigned int number_of_items;
unsigned char labels[1]; // [number of labels]
} labels_t;
```

## Bigendian

The MNIST dataset is stored in bigendian. If you want to get the right `number of items`

, which is `unsigned int`

, you may need to convert it to littleendian.

```
// The sizes in each dimension are 4-byte integers (MSB first, high endian, like in most non-Intel processors).
int is_bigendian() {
int i = 1;
char *p = (char *)&i;
if (p[0] == 1)
return 0;
else
return 1;
}
unsigned int bit32conversion(unsigned int num) {
return ((num>>24)&0xff) | ((num<<8)&0xff0000) | ((num>>8)&0xff00) | ((num<<24)&0xff000000);
}
if(!is_bigendian()) {
number_of_items = bit32conversion(*number_of_items);
}
```

## The Distance of Images

There are multiple ways to caculate the distance of two images, one simple way is to using Euclidean distance.

```
// caculate the euclidean distance
double distance(unsigned char img1[], unsigned char img2[]) {
int i,j;
double sum = 0, value;
for(i = 0, j= 28*28; i < j; i++) {
value = img1[i] - img2[i];
sum = sum + value*value;
}
return sum;
}
```

## Multi-thread Design

In this design, finding the `k-Nearest Neighbors`

in the train set for all the images in the test set is a time consuming process. We can design multi-thread program, in which each thread compute only a subset of the test set. The working structure for the thread can be designed as:

```
// the k-nearest neighbors is here.
typedef struct knn_s{
int index;
double distance;
} knn_t;
typedef struct cfg_s{
knn_t * knn;
unsigned int number_of_neighbor;
unsigned int k;
unsigned int train_start, train_stop, test_start, test_stop;
unsigned int total;
unsigned int * hit;
unsigned int thread;
} cfg_t;
```

In the multi-thread design, the program can utilize all the cores of the CUPs.

## Performance Evaluate

The performance we evaluate here is the `hit ratio`

, which indicate the percentage the program success in telling the right digit from the image.

k | hit rate (%) |
---|---|

1 | 97.91 |

2 | 96.27 |

3 | 97.05 |

4 | 96.82 |

5 | 96.88 |

6 | 96.77 |

7 | 96.94 |

8 | 96.70 |

9 | 96.59 |

10 | 96.65 |

The best performance is get when `k = 3`

.

The complete source code can be got from: https://github.com/janzhou/cap5610