There has been some significant progress in “deep learning”, AI, and image recognition over the past couple of years; Google, Microsoft, and Amazon each have their own service offering. But what is the service like? How useful is it?
Everyone’s having a go at making a chatbot this year (and if you’re not, perhaps you should contact me for consultancy or training!) – and although there are some great examples out there, I’ve not seen much in the e-commerce sector worth talking about.
In this article I’m going to show you a cool use case for an image recognition e-commerce chatbot via a couple of clever APIs wired together by botframework.
The Concept
Could we build a chatbot that acts as a Virtual Shop Assistant, allowing the user to upload an image of a item of clothing they’d like to buy, and have the chatbot reply with similar items that I could buy?
Is image recognition up to the task? There are a few solutions out there; a lot of very reasonably priced services from a few big players (Microsoft, Amazon, and Google).
There are plenty of affiliate networks to allow us to find matching products, and there must be an API we can leverage.
Here’s how I want to structure this concept project:
- User uploads image
- Bot processes the image to find a product
- Bot responds with the description of the product it found in the image
- Bot finds similar products for sale
- Bot returns products in carousel (with affiliate links!)
- User taps the item they would like
- Bot redirects user to that item’s website for purchase
Let’s get started!
Stage 1: Image Upload
In a previous article I explained how to receive files from the user in a botframework chatbot; we’ll implement that again here to receive the image the user has posted. Something like this should work:
public async Task MessageReceivedAsync(IDialogContext context, IAwaitable<IMessageActivity> argument)
{
var message = await argument;
var connector = new ConnectorClient(new Uri(message.ServiceUrl));
var image = message.
Attachments?.
FirstOrDefault(x => x.ContentType.ToLowerInvariant().Contains("image"));
// if the image was passed as byte data in Content, use that
var data = image.Content ??
// if not, download it from the ContentUrl
connector
.HttpClient
.GetByteArrayAsync(image?.ContentUrl)
.Result;
}
Obviously I’m ignoring all sanity checking in this code for brevity – e.g. “is there even an attachment”, that sort of thing..
Stage 2: Product Image Recognition
Let’s investigate the Image Recognition APIs available from the big players (and a small one) and their results using a test image.
The image I’m using is for a product from Asos, since I didn’t want to just use a basic “black shirt” or something simple; this is a “Longline Roll Neck Sleeveless Nude Jumper”:
This should be interesting – let’s see how they do!
Microsoft’s Computer Vision API
The first contender comes from Microsoft’s Cognitive Research arm; their image recognition solution is called Computer Vision API.
The Computer Vision API returns information about visual content found in an image. You can use tagging, descriptions, and domain-specific models to identify content and label it – however, the only “domain specific model” that exists so far is “celebrities”, so not really useful for clothing.
Let’s see how does it does analysing the test image:
Result: “a woman posing for a picture” with the tags “person” and “standing”.
Ooookay. Not really what we need for a product recognition concept, right? Maybe it can get better in future; there is the option to use a domain specific model, however the only model that exists at the moment is “celebrities”. I’d love to be able to train it with my own model, perhaps provide a huge upload of images and associated product descriptions.
Summary
It certainly described the image, but did it give me the contextual accuracy I require? Nope. Even the colour breakdown isn’t particularly good. Shame.
It does have a category taxonomy which would suggest the ability to provide domain-specific information to allow for more detailed categorisation, but it doesn’t look like you can train it on a known data set; for example, your website’s product catalogue with all the associated images and descriptions.
Pricing
- Free: 5000 calls per month capped limit
- Standard: $1.50 per 1000 calls, 10 transactions per second limit
Time to move on..
Google Vision API
Google Cloud Vision API enables developers to understand the content of an image by encapsulating powerful machine learning models in an easy to use REST API. It quickly classifies images into thousands of categories (e.g., “sailboat”, “lion”, “Eiffel Tower”), detects individual objects and faces within images, and finds and reads printed words contained within images.
So how does it fare with our clothing image?
Result: “Clothing, Sleeve, Dress, Photo Shoot, Outerwear, Neck, Textile, Pattern, Collar, Beige..”
Hmm. Slightly better.
Summary
Slightly better than Microsoft’s offering, perhaps? Still not something we could pass into an ecommerce site’s search box, hoping to get a similar item though.
However, the colour breakdown is nice, and the demo site displays it well.
The full JSON response has a ridiculous amount of detail about the face; from the bounding rectangle of the face elements to the location of eyebrow edges.
It just keeps on going..!
Pricing
$1.50 per 1000 images per month, with the first 1000 per month free.
Amazon’s AWS Rekognition
With Amazon’s Rekognition, you can detect objects, scenes, and faces in images. You can also search and compare faces. Rekognition’s API enables you to quickly add sophisticated deep learning-based visual search and image classification to your applications.
Let’s see what it can do with the dress image:
Result: “Human, People, Person, Cardigan, Clothing, Sweater, Blonde, Female..”
Summary
Yeah, pretty similar to Google’s results really. Still not something we can search on, so it looks like maybe my visual product search chatbot will never get off the ground..
Pricing
$1 per 1000 images processed.
Let’s give it one last try.
Cloud Sight API
CloudSight’s mission is to become the global leader in image captioning and understanding. You can make things more discoverable for your e-commerce site or marketplace through augmented product and image details such as brand, style, type, and more.
How does it deal with our test image?
Result: “women’s brown cowl neckline sleeveless midi dress”
Woah. That’s even more accurate than the original Asos description..!
Summary
We have a winner! But how?! If Microsoft, Google, and Amazon collectively fail to achieve such incredible product specific accuracy, how have CloudSight managed?
Of course, CloudSight don’t go around telling everyone their secret sauce, but if you follow various discussions on Reddit you’ll find several people of the opinion that it’s a mechanical turk – or at the very least a semi-mechanical one; that is, humans in the background, pretending to be a machine, possibly helping to train the underlying AI.
However, there’s also a great article from their CTO about using Amazon EC2 instances with nvidia docker images to expose the GPUs for deep learning, as well as an interesting article on Visual Cognition itself.
Given how limited in terms of scaling it would be to rely on humans for this, I’m pretty certain they’re doing the GPU-based solution.
Pricing
Either solution could certainly explain the high cost per image when compared to Microsoft, Amazon, or Google:
Wiring CloudSight into Botframework
Let’s update the MessageReceivedAsync
method to pass the image data over to CloudSight for analysis:
public async Task MessageReceivedAsync(IDialogContext context, IAwaitable<IMessageActivity> argument)
{
var message = await argument;
var connector = new ConnectorClient(new Uri(message.ServiceUrl));
var image = message.
Attachments?.
FirstOrDefault(x => x.ContentType.ToLowerInvariant().Contains("image"));
// if the image was passed as byte data in Content, use that
var data = image.Content ??
// if not, download it from the ContentUrl
connector
.HttpClient
.GetByteArrayAsync(image?.ContentUrl)
.Result;
// process the image
var product = await ProcessImage(context, message, data);
}
To query the CloudSight API you need to build a specific structure for your request; the documentation and SDKs are in python, go, ruby, and objective-c, so hooking it up in C# can be a bit tricky.
Reading through their their github repos allowed me to reference the various other implementations and come up with a C# version.
We have to do quite a lot of code here since CloudSight can sometimes take a while to respond – even timeout. I’ve decided to go with just Thread.Sleep
ing for this example, but you could create an out of band proactive reply if you prefer, or wire in a webhook perhaps.
The key elements are:
Building the CloudSight request
var content =
new MultipartFormDataContent("Upload----" + DateTime.Now)
{
{
// the image byte data
new StreamContent(new MemoryStream(data)),
"image_request[image]", "image.jpg"
},
{
new StringContent("en-GB"),
"image_request[locale]"
}
};
var imgClient =
new HttpClient
{
BaseAddress = new Uri("https://api.cloudsightapi.com/")
};
imgClient.DefaultRequestHeaders.Authorization =
new AuthenticationHeaderValue("CloudSight", "<your api key goes here>");
Submit the request and check the processing status
// Send the image for processing to /image_requests
var responseMessage =
await imgClient.PostAsync("image_requests", content);
// Get the token for this request from the response
var jsonimageresponse =
await responseMessage.Content.ReadAsStringAsync();
// get a dynamic object using Newtonsoft.Json
dynamic imageresponse =
JsonConvert.DeserializeObject(jsonimageresponse);
// check the image processing status using the token
// (this is a different endpoint - /image_responses)
var jsonimagestatus = await
(await imgClient.GetAsync($"image_responses/{imageresponse.token}"))
.Content
.ReadAsStringAsync();
dynamic imagestatus = JsonConvert.DeserializeObject(jsonimagestatus);
// it will be in "imagestatus.status"
Check the processing status until we get a result or a timeout
// if it's not Completed or Timed Out yet, wait and poll
while (imagestatus.status != "completed" && imagestatus.status != "timeout")
{
// wait a couple of seconds
Thread.Sleep(2000);
// check the status again
jsonimagestatus = await
(await imgClient.GetAsync($"image_responses/{imageresponse.token}"))
.Content
.ReadAsStringAsync();
imagestatus = JsonConvert.DeserializeObject(jsonimagestatus);
}
Now let’s pull all of the above chunks of code together into a single method with a few “typing” responses where appropriate:
private static async Task<string> ProcessImage(IDialogContext context,
IMessageActivity message,
byte[] data)
{
// build the request's content - a very specific request structure
var content =
new MultipartFormDataContent("Upload----" + DateTime.Now)
{
{
// the image byte data
new StreamContent(new MemoryStream(data)),
"image_request[image]", "image.jpg"
},
{
new StringContent("en-GB"),
"image_request[locale]"
}
};
var imgClient =
new HttpClient
{
BaseAddress = new Uri("https://api.cloudsightapi.com/")
};
imgClient.DefaultRequestHeaders.Authorization =
new AuthenticationHeaderValue("CloudSight", "<your api key goes here>");
// Send the image for processing
var responseMessage =
await imgClient.PostAsync("image_requests", content);
// Get the token for this request from the response
var jsonimageresponse =
await responseMessage.Content.ReadAsStringAsync();
// get a dynamic object using Newtonsoft.Json
dynamic imageresponse =
JsonConvert.DeserializeObject(jsonimageresponse);
// check the image processing status using the token
var jsonimagestatus = await
(await imgClient.GetAsync($"image_responses/{imageresponse.token}"))
.Content
.ReadAsStringAsync();
dynamic imagestatus = JsonConvert.DeserializeObject(jsonimagestatus);
// prepare the "typing" response..
var typing = context.MakeMessage();
typing.Type = ActivityTypes.Typing;
// if it's not Completed or Timed Out yet, wait and poll
while (imagestatus.status != "completed" && imagestatus.status != "timeout")
{
// not done yet, show the chatbot loading spinner..
await context.PostAsync(typing);
Thread.Sleep(2000);
jsonimagestatus = await
(await imgClient.GetAsync($"image_responses/{imageresponse.token}"))
.Content
.ReadAsStringAsync();
imagestatus = JsonConvert.DeserializeObject(jsonimagestatus);
}
// Did it Complete or Time Out?
string productdescription;
if (imagestatus.status != "timeout")
{
// Got a result!
productdescription = imagestatus.name;
await
context.PostAsync($"Aha - looks like it's a {productdescription}!");
}
else
{
// Timed Out
await
context.PostAsync("Ah, couldn't find anything this time, sorry.");
}
return productdescription;
}
That should do for receiving an image, submitting it to CloudSight, and getting a product description back (or bombing out).
Now let’s use that to find similar products for the user to spend their money on!
Stage 3: Product Listings
In order to take the product description from CloudSight and turn it into a purchase, I’ve gone with ShopStyle; a shopping platform with an affiliate program. By using ShopStyle I can search on many online fashion shops all at once, and even receive an affiliate sale if anyone clicks through and buys – KACHING!
If you decide to build this yourself you’d just submit the text result from CloudSight into your own site’s search endpoint
The ShopStyle API allows client applications to retrieve the underlying data for all the basic elements of the ShopStyle website, including products, brands, retailers, and categories. For ease of development, the API is a REST-style web service, composed of simple HTTP GET requests. Data is returned to the client in either XML or JSON formats.
We can hit their api endpoint for product listing with the search term and we’ll get a result that looks like this:
Should be pretty easy to extract product data from that and display a carousel, right? Let’s have a go:
Query ShopStyle
// build the shopstyle query
var shopClient = new HttpClient {
BaseAddress = new Uri("http://api.shopstyle.com/api/v2/")
};
var jsonproductresponse = await
(await
shopClient.GetAsync(
"products?" +
$"pid=<your api key goes here>&" +
$"fts={HttpUtility.UrlEncode(productdescription)}&"+
"offset=0&limit=10"))
.Content
.ReadAsStringAsync();
// create a dynamic object from the json response
dynamic productresponse = JsonConvert.DeserializeObject(jsonproductresponse);
Create a list of Hero cards for the carousel from the dynamic response object:
var productlist = new List<Attachment>();
// show a max of 5 items
int productMax =
productresponse.metadata.total < 5 ?
productresponse.metadata.total : 5;
for (var i = 0; i < productMax; i++)
{
// create a link to the product as a Card Action
var buttons = new List<CardAction>
{
new CardAction
{
Title = "View details",
Type = "openUrl",
Value = productresponse.products[i].clickUrl
}
};
// try to get an image if there is one
var imgs = new List<CardImage>();
string img =
productresponse
.products[i]?
.image?
.sizes?
.XLarge?
.url;
if (!string.IsNullOrEmpty(img))
{
imgs.Add(new CardImage(img));
}
// add the Card Action and the image to a Hero Card attachment
var attachment = new HeroCard
{
Text = productresponse.products[i].name,
Images = imgs,
Subtitle = productresponse.products[i].priceLabel,
Buttons = buttons
};
productlist.Add(attachment.ToAttachment());
}
Respond with a carousel:
// create the carousel from the product list
var carousel = context.MakeMessage();
carousel.Attachments = productlist;
carousel.AttachmentLayout = AttachmentLayoutTypes.List;
carousel.Text = "Similar products - tap to buy";
await context.PostAsync(carousel);
Let’s pull that all together into a single method with some extra messages to the user:
private static async Task ShowProductListing(IDialogContext context, string productdescription)
{
await context.PostAsync($"Now looking for similar products - brb!");
// build the shopstyle query
var shopClient = new HttpClient {
BaseAddress = new Uri("http://api.shopstyle.com/api/v2/")
};
var jsonproductresponse = await
(await
shopClient.GetAsync(
"products?" +
$"pid=<your api key goes here>&" +
$"fts={HttpUtility.UrlEncode(productdescription)}&"+
"offset=0&limit=10"))
.Content
.ReadAsStringAsync();
// create a dynamic object from the json response
dynamic productresponse
= JsonConvert.DeserializeObject(jsonproductresponse);
// did we find any results?
if (productresponse.metadata.total > 0)
{
await
context.PostAsync($"I found {productresponse.metadata.total} items!");
var productlist = new List<Attachment>();
// show a max of 5 items
int productMax =
productresponse.metadata.total < 5 ?
productresponse.metadata.total : 5;
for (var i = 0; i < productMax; i++)
{
// create a link to the product as a Card Action
var buttons = new List<CardAction>
{
new CardAction
{
Title = "View details",
Type = "openUrl",
Value = productresponse.products[i].clickUrl
}
};
// try to get an image if there is one
var imgs = new List<CardImage>();
string img = productresponse
.products[i]?
.image?
.sizes?
.XLarge?
.url;
if (!string.IsNullOrEmpty(img))
{
imgs.Add(new CardImage(img));
}
// add the Card Action and the image to a Hero Card attachment
var attachment = new HeroCard
{
Text = productresponse.products[i].name,
Images = imgs,
Subtitle = productresponse.products[i].priceLabel,
Buttons = buttons
};
productlist.Add(attachment.ToAttachment());
}
// create the carousel from the product list
var carousel = context.MakeMessage();
carousel.Attachments = productlist;
carousel.AttachmentLayout = AttachmentLayoutTypes.Carousel; //or List
carousel.Text = "Similar products - tap to buy";
await context.PostAsync(carousel);
}
else
{
await
context.PostAsync($"Sorry, didn't find anything. Try again?");
}
}
Stage 4: Wiring it up
Now that I’ve managed to find an image recognition solution that’s surprisingly accurate, and a fashion e-commerce api that takes an arbitrary search term, connecting them together should give us a basic bot that can take an input image and return products in a nice carousel to browse and buy.
This entire proof of concept consists of just 3 methods: MessageReceivedAsync
, ProcessImage
, and ShowProductListing
.
In the main
MessageReceivedAsync
method I’ve added in a little sanity checking and the ability to bypass image recognition if there’s no image and the message contains some text.
public async Task MessageReceivedAsync(IDialogContextcontext, IAwaitable<IMessageActivity> argument)
{
var message = await argument;
// default to message text, so we can bypass the image recognition
var product = message.Text;
// is there an image in the message?
var image = message.
Attachments?.
FirstOrDefault(x => x.ContentType.ToLowerInvariant().Contains("image"));
// if so, fire off image recognition
if (image != null)
{
var connector =
new ConnectorClient(new Uri(message.ServiceUrl));
// if the image was passed as byte data in Content, use that
var data = image.Content as byte[] ??
// if not, download it from the ContentUrl
connector
.HttpClient
.GetByteArrayAsync(image.ContentUrl)
.Result;
product = await ProcessImage(context, message, data);
}
// find matching products and display them
await ShowProductListing(context, product);
context.Wait(MessageReceivedAsync);
}
The End Result
Summary
As you can see, product specific image recognition can be exceptional in some cases; pretty much spot on for this proof of concept. The image description is so specific that in some cases it means the product search doesn’t find anything; this is more a failure of the product search functionality though.
Unfortunately, it’s quite cost-prohibitive – just to break even with CloudSight I’d need to have 7500 affiliate clicks with a minimum of 4 cents earned per click each month. That doesn’t take into account hosting costs (receiving and sending the image data around could be charged depending on your hosting solution).
Maybe the ability to “train” an image recognition solution using a seed set of image data, such as your own product catalogue, would help. It certainly would be interesting to try. (*Ahem* Any forward-thinking e-commerce companies out there willing to pay me to do this, please get in contact!)
But why do this as a chatbot? Why not a web page? Anecdotally, I’ve found people happier to upload an image in a chat than submit it to a webpage. Also, I personally feel that the asynchronous nature of a conversational style allows for a more natural interface, in this particular scenario. It would be interesting to try both approaches out and user test them.
So where to go next with this?
In this particular concept, I’d look to improve the conversation flow; e.g. not rely on just an image but also ask for more information from the user if possible. Maybe I’ll put it in the wild and see how many people click through and possibly purchase in a month.
Some CloudSight coolness to end with
I’ll leave you with these CloudSight responses that just blew me away; how on earth did it come up with this for another few of my test images of friends’ clothing? It makes me think that CloudSight is actually using humans after all!
Wat?!
That is a Mammut beanie..
Stef is wearing a Harley Quin tee under a black zip-up hoodie. This is crazy.
What do YOU think CloudSight are using?.. I’d love to hear theories..
this code is in which language java for android or ios?
This is all C#
Hi. Thank you for sharing. The content is very useful. Can I access the project source? I need this for an academic study
User uploads image
Bot processes the image to find a product
Bot responds with the description of the product it found in the image
Bot finds similar products for sale
Bot returns products in carousel (with affiliate links!)
User taps the item they would like
Bot redirects user to that item’s website for purchase
Seems like the “Bot” in here is a website that calls an API. How is this a bot? I mean how can a bot redirect a user?
> Bot redirects user to that item’s website for purchase
By “redirect” I mean “opens web page”… *ahem*
Everything up until that point is the bot.