Build a state-of-the-art image search engine in 5 easy steps
Earlier in the summer like many NYC residents every year I was looking for a new apartment. It is a daunting task for anyone, browsing thru thousands of apartment listings for hours trying to find the most appealing to me. I only wish there was a way I could upload a picture of my dream apartment board and it will return the most similar apartments available. It turns out it is possible by building an image search engine!
Gathering the data was the first step towards this process. Using this web scraper code I obtained the data and images from all the apartments available in Midtown Manhattan from Apartments.com. Once I had the data and images the fun began. In the next few steps, I will show you how to build an image search engine in python.
As always we start by importing all the relevant libraries.
1. Import Libraries
# for loading/processing the images
from keras.preprocessing.image import load_img
from keras.preprocessing.image import img_to_array
from keras.applications.vgg16 import preprocess_input
from PIL import Image# models
from keras.applications.vgg16 import VGG16
from keras.models import Model# for everything else
import os
import numpy as np
import matplotlib.pyplot as plt
from random import randint
import pandas as pd
import pickle
2. Load Your Images
path = r”C:\Users\MarianaMaroto\Desktop\AIProject\apartments-model\images”# change the working directory to the path where the images are located
os.chdir(path)# this list holds all the image filename
apartments = []# creates a ScandirIterator aliased as files
with os.scandir(path) as files:
# loops through each file in the directory
for file in files:
if file.name.endswith(‘.jpg’):
# adds only the image files to the apartments list
apartments.append(file.name)
3. Load Model with Convolutional Network
VGG-16 is a convolutional neural network, considered one of the best architectures for processing images. It is slow to train but the results are worth it.
# load the model first and pass as an argument
model = VGG16()
model = Model(inputs = model.inputs, outputs = model.layers[-2].output)def extract_features(file, model):
# load the image as a 224x224 array
img = load_img(file, target_size=(224,224))
# convert from 'PIL.Image.Image' to numpy array
img = np.array(img)
# reshape the data for the model reshape(num_of_samples, dim 1, dim 2, channels)
reshaped_img = img.reshape(1,224,224,3)
# prepare image for model
imgx = preprocess_input(reshaped_img)
# get the feature vector
features = model.predict(imgx, use_multiprocessing=True)
return features
4. Feature Extraction
Go grab yourself a cup of coffee… this step takes a while depending on the number of images you are working with.
data = {}
p = r"C:\Users\MarianaMaroto\Desktop\AIProject\apartments-model\apartments_features.pkl"# lop through each image in the dataset
for apt in apartments:
feat = extract_features(apt,model)
data[apt] = feat
# get a list of the filenames
filenames = np.array(list(data.keys()))# get a list of just the features
feat = np.array(list(data.values()))
feat.shape
(len(apartments), 1, 4096)# reshape so that there are lots of samples of 4096 vectors
feat = feat.reshape(-1,4096)
feat.shape
(len(apartments), 4096)
5. Use your image search engine!
In the following code, you want to specify the file path of your query image and the number of most similar results you want to obtain. Afterwards, the loop will return to you the closest images based on the Euclidean Distance.
# Insert the image query
img = r"C:\Users\MarianaMaroto\Desktop\AIProject\apartments-model\test_images\view.jpg"plt.axis('off')
plt.imshow(Image.open(img))
plt.show()# Extract its features
query = extract_features(img,model)# Calculate the similarity (distance) between images
dists = np.linalg.norm(feat - query, axis=1)# Extract images that have lowest distance
ids = np.argsort(dists)[:n_results]
scores = [(dists[id], apartments[id]) for id in ids]# Visualize the result
axes=[]
fig=plt.figure(figsize=(8,8))for a in range(n_results):
score = scores[a]
axes.append(fig.add_subplot(1, n_results, a+1))
subplot_title=str(score[0])
axes[-1].set_title(subplot_title)
plt.axis('off')
plt.imshow(Image.open(score[1]))
fig.tight_layout()
plt.show()
As you can see the image search engine for apartment listings works great! We are able to get very similar results to images from Pinterest. This search engine could be applied to a lot more than just apartments, it all depends on the data you have available. Maybe clothing, potentially even dating apps? Be right back going to upload a picture of James Franco.