Hand Signals Recognition using Convolutional Neural Network

================================================================================

This weekend I manage to release my yet-another-toy project called  hsr.

In this medium story, I want to share with you about my experiences on how I collect the training data, how I implement and train the neural net model. I also highlight a TensorFlow’s  Input Pipeline that I use in this project.

Collecting The Training Data

Data collection process

I use a Webcam API on Chrome browser to collect the data. The idea is to capture each frames via the webcam and save it to the HTML canvas. Here is the  gist of HTML and JS script that I use. I just click the start button and it start capturing itself.

I found that downloading the captured image one-by-one is very time consuming. So, I use this simple script to download all the captured images:

// get possible elements, init variables
image_els = document.querySelectorAll("img");
var image_urls = []; // search for <img src="" />
for (var a = 0; a < image_els.length; a++) {
    if ((href = image_els[a].getAttribute("src"))) image_urls.push(href);
} // generate <a> and force download
// download 50 images
for (var i = 0; i < 50; i++) {
    var link = document.createElement("a");
    var filename = "5-1-" + i;
    link.setAttribute("download", filename);
    link.setAttribute("href", image_urls[i]);
    link.click();
    delete link;
}

Neural Network Model

The model is highly inspired by LeNet-5 (LeCun, 1998). You can read the  paper.

For the input, I choose the size 320x240. I use two convolution layers with filter size 32x32 and 16x16. All the convolution layer is followed by subsampling layer using max-pooling operations. For the fully-connected layer, I use one hidden layer only with the size of 400 units. For the output layer, I use softmax with the size of 5 units. Each unit on the output layer represent the class of hand signals.

This architecture can be easily translated into computation-graph in TensorFlow:

    # Convolutional layer 1
    with tf.name_scope('conv1'):
        W = tf.Variable(
            tf.truncated_normal(
                shape=(
                    CONV1_FILTER_SIZE,
                    CONV1_FILTER_SIZE,
                    NUM_CHANNELS,
                    CONV1_FILTER_COUNT),
                dtype=tf.float32,
                stddev=5e-2),
            name='weights')
        b = tf.Variable(
            tf.zeros(
                shape=(CONV1_FILTER_COUNT),
                dtype=tf.float32),
            name='biases')
        conv = tf.nn.conv2d(
            input=images,
            filter=W,
            strides=(1, 1, 1, 1),
            padding='SAME',
            name='convolutional')
        conv_bias = tf.nn.bias_add(conv, b)
        conv_act = tf.nn.relu(
            features=conv_bias,
            name='activation')
        pool1 = tf.nn.max_pool(
            value=conv_act,
            ksize=(1, 2, 2, 1),
            strides=(1, 2, 2, 1),
            padding='SAME',
            name='subsampling')

    # Convolutional layer 2
    with tf.name_scope('conv2'):
        W = tf.Variable(
            tf.truncated_normal(
                shape=(
                    CONV2_FILTER_SIZE,
                    CONV2_FILTER_SIZE,
                    CONV1_FILTER_COUNT,
                    CONV2_FILTER_COUNT),
                dtype=tf.float32,
                stddev=5e-2),
            name='weights')
        b = tf.Variable(
            tf.zeros(
                shape=(CONV2_FILTER_COUNT),
                dtype=tf.float32),
            name='biases')
        conv = tf.nn.conv2d(
            input=pool1,
            filter=W,
            strides=(1, 1, 1, 1),
            padding='SAME',
            name='convolutional')
        conv_bias = tf.nn.bias_add(conv, b)
        conv_act = tf.nn.relu(
            features=conv_bias,
            name='activation')
        pool2 = tf.nn.max_pool(
            value=conv_act,
            ksize=(1, 2, 2, 1),
            strides=(1, 2, 2, 1),
            padding='SAME',
            name='subsampling')

    # Hidden layer
    with tf.name_scope('hidden'):
        conv_output_size = 28800
        W = tf.Variable(
            tf.truncated_normal(
                shape=(conv_output_size, HIDDEN_LAYER_SIZE),
                dtype=tf.float32,
                stddev=5e-2),
            name='weights')
        b = tf.Variable(
            tf.zeros(
                shape=(HIDDEN_LAYER_SIZE),
                dtype=tf.float32),
            name='biases')
        reshape = tf.reshape(
            tensor=pool2,
            shape=[BATCH_SIZE, -1])
        h1 = tf.nn.relu(
            features=tf.add(tf.matmul(reshape, W), b),
            name='activation')

    # Softmax layer
    with tf.name_scope('softmax'):
        W = tf.Variable(
            tf.truncated_normal(
                shape=(HIDDEN_LAYER_SIZE, NUM_CLASS),
                dtype=tf.float32,
                stddev=5e-2),
            name='weights')
        b = tf.Variable(
            tf.zeros(
                shape=(NUM_CLASS),
                dtype=tf.float32),
            name='biases')
        logits = tf.add(tf.matmul(h1, W), b, name='logits')

You can see full-version of the  graph.

Training The Network

For the training process, the first step is to read all the images.

def read_images(data_dir):
    pattern = os.path.join(data_dir, '*.png')
    filenames = tf.train.match_filenames_once(pattern, name='list_files')

    queue = tf.train.string_input_producer(
        filenames,
        num_epochs=NUM_EPOCHS,
        shuffle=True,
        name='queue')

    reader = tf.WholeFileReader()
    filename, content = reader.read(queue, name='read_image')
    filename = tf.Print(
        filename,
        data=[filename],
        message='loading: ')
    filename_split = tf.string_split([filename], delimiter='/')
    label_id = tf.string_to_number(tf.substr(filename_split.values[1],
        0, 1), out_type=tf.int32)
    label = tf.one_hot(
        label_id-1,
        5,
        on_value=1.0,
        off_value=0.0,
        dtype=tf.float32)

    img_tensor = tf.image.decode_png(
        content,
        dtype=tf.uint8,
        channels=3,
        name='img_decode')

    # Preprocess the image, Performs random transformations
    # Random flip
    img_tensor_flip = tf.image.random_flip_left_right(img_tensor)

    # Random brightness
    img_tensor_bri = tf.image.random_brightness(img_tensor_flip,
        max_delta=0.2)

    # Per-image scaling
    img_tensor_std = tf.image.per_image_standardization(img_tensor_bri)

    min_after_dequeue = 1000
    capacity = min_after_dequeue + 3 * BATCH_SIZE
    example_batch, label_batch = tf.train.shuffle_batch(
        [img_tensor_std, label],
        batch_size=BATCH_SIZE,
        shapes=[(IMAGE_HEIGHT, IMAGE_WIDTH, NUM_CHANNELS), (NUM_CLASS)],
        capacity=capacity,
        min_after_dequeue=min_after_dequeue,
        name='train_shuffle')

    return example_batch, label_batch

read_images takes an path to directory as an argument. This will take care of the preprocessing step like a randomly flip the image and shuffle the batch process. The random transformation step is used to prevent overfitting on the network.

This function is implementation of Input Pipeline in TensorFlow. The idea is : you create a producer, in this case the producer is string_input_producer then you create the reader for each produced string. The last one, you pass it the results to another queue that handle a batching process like tf.train.shuffle_batch. And each you run this graph, you will get the batch of data.

The next step is to define a loss function and the optimizer. I use cross entropy for the loss and Adam for the optimizer.

That’s it. You can see a full implementation on  my github.

================================================================================