Terryhung/gist:a907480ff50b266055dc

## gistfile1.txt
\section{Experiment}
Our fine-tuning takes an already learned model: BVLC CaffeNet Model. CaffeNet is modified by AlexNet. This model is the result of Caffnet training on ImageNet. We set the result of fine-tuning as our baseline.

We use the dataset from Microsoft: Clickture-FilteredDog. This is a subset of the Clickture-Full dataset which only contains the dog breed related items. We pick out 107 class of this subset which contains more than 100 images total 89,910 images. We use 5-fold to split this dataset: 7,1932 images for training and 17,978 for testing.

Our result on Clickture-FilteredDog in Table 1 and Table 2. Our network achieves accuracy of \textbf{50.5\%}. The best performance with fine-tuning is 46.2\%.

In Table 1, the result of our first approach, average vector, does not exceed baseline(fine-tuning). Our MMD loss does not fall down after 7000? iterations, so we consider that average reduces some information in text such that the performance do not better than fine-tuning. Form t-SNE algorithm, we can discovery some different types of dog are clustered together, so we conclude that average vector feature will lose some useful information in text because of average the vector word2vec.

In Table 2, most of our result surpass the baseline. Networks utilizing VLAD feature has a total 4\% improvement over baseline. The VLAD feature make our network has impressive improvement: This model let 12\% of error prediction which made by average vector method be correct and 8\% of correct prediction be error. %The loss of MMD can be lower than average vector method, it reflect that those images with dissimilar text will be pull away.


\begin{table}[]
	\centering
	\label{my-label}
	\resizebox{!}{!}{
	\begin{tabular}{|l|l|l|l|l|}
		\hline
		Text Feature & 0.25* & 0.1* & 0.25 & 0.1 \\ \hline
		baseline               & 46.2\%          & 46.2\%         & 46.2\%                         & 46.2\%                        \\ \hline
		avg-vec                &                 &                &                                &                               \\ \hline
	\end{tabular} }
	\caption{: The head of figure is the weight of MMD loss and the figure with an asterisk* is the left of network without fc\_adapt }
\end{table}

\begin{table}[]
	\centering
	\resizebox{88mm}{!}{
	\begin{tabular}{|l|l|l|l|l|l|l|l|}
		\hline
		Text Feature & 0.25 & 0.1 & 0.05 & 0.01 & 0.005 & 0.001 & 0.0005   \\ \hline
		baseline        & 46.2\%   & 46.2\%  & 46.2\%   & 46.2\%   & 46.2\%    & 46.2\%    & 46.2\%       \\ \hline
		vlad-1024       & 45.8\%   & 48.0\%  & 48.4\%   & 49.6\%   & 49.7\%    & 49.9\%    &              \\ \hline
		vlad-2048       & 43.0\%   & 47.4\%  & 48.8\%   & 50.0\%   & 50.0\%    & 50.2\%    &              \\ \hline
		vlad-4000       &          &         & 48.2\%   & 49.9\%   & 50.1\%    & 50.1\%    & 50.2\%       \\ \hline
		vlad-2048* & 43.1\%   & 47.6\%  & 48.3\%   & 50.0\%   & 50.0\%    & 50.2\%    &              \\ \hline
		vlad-4000* &          &         & 48.4\%   & 50.0\%   & 50.0\%    & 50.3\%    & {\bf 50.5\%} \\ \hline
	\end{tabular} }
	\caption{The head of figure is the weight of MMD loss. Text feature with an asterisk* is "local vlad"}
\end{table}
	\section{Experiment}
	Our fine-tuning takes an already learned model: BVLC CaffeNet Model. CaffeNet is modified by AlexNet. This model is the result of Caffnet training on ImageNet. We set the result of fine-tuning as our baseline.

	We use the dataset from Microsoft: Clickture-FilteredDog. This is a subset of the Clickture-Full dataset which only contains the dog breed related items. We pick out 107 class of this subset which contains more than 100 images total 89,910 images. We use 5-fold to split this dataset: 7,1932 images for training and 17,978 for testing.

	Our result on Clickture-FilteredDog in Table 1 and Table 2. Our network achieves accuracy of \textbf{50.5\%}. The best performance with fine-tuning is 46.2\%.

	In Table 1, the result of our first approach, average vector, does not exceed baseline(fine-tuning). Our MMD loss does not fall down after 7000? iterations, so we consider that average reduces some information in text such that the performance do not better than fine-tuning. Form t-SNE algorithm, we can discovery some different types of dog are clustered together, so we conclude that average vector feature will lose some useful information in text because of average the vector word2vec.

	In Table 2, most of our result surpass the baseline. Networks utilizing VLAD feature has a total 4\% improvement over baseline. The VLAD feature make our network has impressive improvement: This model let 12\% of error prediction which made by average vector method be correct and 8\% of correct prediction be error. %The loss of MMD can be lower than average vector method, it reflect that those images with dissimilar text will be pull away.




	\begin{table}[]
	\centering
	\label{my-label}
	\resizebox{!}{!}{
	\begin{tabular}{\|l\|l\|l\|l\|l\|}
	\hline
	Text Feature & 0.25* & 0.1* & 0.25 & 0.1 \\ \hline
	baseline & 46.2\% & 46.2\% & 46.2\% & 46.2\% \\ \hline
	avg-vec & & & & \\ \hline
	\end{tabular} }
	\caption{: The head of figure is the weight of MMD loss and the figure with an asterisk* is the left of network without fc\_adapt }
	\end{table}

	\begin{table}[]
	\centering
	\resizebox{88mm}{!}{
	\begin{tabular}{\|l\|l\|l\|l\|l\|l\|l\|l\|}
	\hline
	Text Feature & 0.25 & 0.1 & 0.05 & 0.01 & 0.005 & 0.001 & 0.0005 \\ \hline
	baseline & 46.2\% & 46.2\% & 46.2\% & 46.2\% & 46.2\% & 46.2\% & 46.2\% \\ \hline
	vlad-1024 & 45.8\% & 48.0\% & 48.4\% & 49.6\% & 49.7\% & 49.9\% & \\ \hline
	vlad-2048 & 43.0\% & 47.4\% & 48.8\% & 50.0\% & 50.0\% & 50.2\% & \\ \hline
	vlad-4000 & & & 48.2\% & 49.9\% & 50.1\% & 50.1\% & 50.2\% \\ \hline
	vlad-2048* & 43.1\% & 47.6\% & 48.3\% & 50.0\% & 50.0\% & 50.2\% & \\ \hline
	vlad-4000* & & & 48.4\% & 50.0\% & 50.0\% & 50.3\% & {\bf 50.5\%} \\ \hline
	\end{tabular} }
	\caption{The head of figure is the weight of MMD loss. Text feature with an asterisk* is "local vlad"}
	\end{table}