This study presents how to learn feature representation and similarity metric function directly from RGB-D image data for comparing RGB-D image patches. Comparing image pairs is one of fundamental task for many vision problems in robotics application such as object tracking, classification and registration. Among them, we especially focus on object registration and tracking. Traditional means of registering and tracking objects use pre-known parametric model of target objects. However, robots need to deal with various objects without using a pre-known object model. To deal with objects in a non-parametric manner, associating each individual part of an object at a previous time with each one at a current time is necessary so that coupled data indicate the identical part of the object. For this, we propose voxel comparison-based feature matching method. Each voxel at a previous time is associated with one at the current time that is the most similar in the feature space. Features are extracted by training a convolutional neural network(CNN) using a Washington RGB-D objects dataset that is modified to have rotational variance. We explore and study multiple neural network structures to find which structure is suitable for this task. Performances of each structure are compared by Receiver Operating Characteristic(ROC) curves.