Since I posted about Amazon’s new Go store in downtown SF, I’ve had a lot of spirited discussions about whether it is using AI or not, how it’s done, and whether we’ve foolishly traded cashiers for Skynet. Since it’s a rainy Saturday morning in SF, and I’m a little hung over, I decided to veg on the couch and noodle on how I think they are doing it.
Before we get started… I have worked with Amazon Deep Learning (DL) Engineers, I know Amazon DL Engineers, Amazon DL Engineers are friends of mine, and I am no Amazon DL Engineer… They are much more clever than I am, and probably have an infinitely cooler way of doing this. Nevertheless, here’s my take.
Building Go involves two main problems. First problem, tracking a unique customer inside the store. Second problem, tracking the location of the individual item to be purchased, the SKU. You might consider correlating the customer to the SKU its own problem, but if you solve 1 and 2 properly, you get correlation for free.
It’s also important to consider the assumptions used when designing Go. They’ve confirmed they aren’t using RFID and the store shelving (the refrigerated stuff), precludes weight sensors, which leads me to believe they are using a computer vision approach. This means dozens of cameras in the store, watching everything going on.
The most important design criteria is that Amazon gets to control and meticulously document each and everything going into the Go store, including the people. This allows them to ruthlessly constrain the problem, and deliver a delightful experience. As I began to understand this meticulous control of the environment, and peel back the layers, I was floored by its ingenuity.
Let’s start with the store itself. Below is a snapshot of a couple of items I bought this week in Go.
Notice anything interesting about them (besides the fact that I ate the bacon sandwich, hungover remember)? Similar products have differing, high contrast coloring in their labeling. I bought a bacon sandwich, which was right next to turkey sausage sandwich, but the labels were starkly different colors. As I thought about this it became apparent it’s a very clever, physical implementation of Orthogonal Frequency Division Multiplexing (OFDM).
If you treat each product as a sub carrier frequency, by placing orthogonal signals next to each other, they can minimize the “cross talk” in their model. Even the bags they offer in the store are a high contrast orange, which would allow them to highly accurately identify the bag which would boost several signals. I’m sure this contributes a few points in the accuracy of their models. They might even be able to get away with using a form of boosted scale invariant template matching (template matching) instead of fully convolutional neural networks (CNN), though they probably still use CNNs.
I’d guess that Amazon has a planning system that helps them design the layout of the store so that items in which they can’t control the label are placed orthogonally and where they can control the label, they automatically design one that is highly orthogonal. This would all be packaged into a sophisticated image registration system, for every product that is sold in a Go store.
Imagine how accurate a CNN would be if you could take multiple pictures with varying entropy of every single item that the object detector would need to find, and train it exclusively on those images, with an integrated ablation study for augmentation. I’d bet that Amazon has a rig that takes a 3-dimensional image of the item, augments it nine ways from Sunday, and then trains the model from this engineered and augmented dataset. With this highly optimized dataset, you could use a You Only Look Once (YOLO) type model (Yolo) and address the SKU tracking problem handily. Pretty nifty if you ask me.
Which brings us to Buffy…
Amazon has confirmed that they aren’t using Facial Recognition in the Go store. No surprise there, it’s not the right tool. To understand how they are solving this you have to take a look at how you enter the store. Amazon has a Go app, that you open before entering the store, and displays a barcode. When you enter the store you go through a turnstile and scan the barcode on your phone, to unlock the turnstile.
Let’s look at all the control items in this “simple” walk in. Amazon is able to exactly control where you are and how you walk into the store. They control the height (down to the millimeter) of the turnstyle and the scanner that you place your phone to enter the store. Basically, they are “forcing” you into a predefined pose, to register you in multi-stage tracking system.
By controlling the entry point and pose they can accurately measure your height, gait, and joint distance. This is way more accurate (especially in this controlled setting) than what they could do with facial recognition.
The first thing they probably do with this is pose estimation. The University of Oxford back in 2008 released the stickmen dataset for human pose estimation based on 4 episodes of Buffy The Vampire Slayer. Truth be told this built on Madonna’s groundbreaking “Strike a Pose” dataset from 2005 (don’t @ me). That was a decade ago, to understand what state of the art is in pose estimation take a look at CMU’s work.
Or go deep with Facebook’s DensePose
If you want to really freak yourself out, check out these real-time experiments based on Google’s PoseNet, Move Mirror and Tracking Demo
Accurate pose estimation opens up a world of possibilities. It’s a cakewalk to track a unique user around a store (remember they control store layout and camera position, no blind spots), but even more importantly they can use the actual poses themselves to boost the accuracy of the SKU tracking. By understanding the actual gestures, they can understand when a customer reaches for something on the shelf, and places something in their bag, or puts it back on the shelf (even in a different position). If you were to take something off the shelf they could analyze the video feed from 5 seconds before and 5 seconds after, cross reference that with know stock levels, and make an even more accurate decision on what was pulled off the shelf.
Combining control of the environment, a sophisticated image registration system, object detection and pose estimation allows them to deliver a truly magical experience. And then it dawns on me… Go is an exercise in control, or by another name supervision. By controlling all aspects of the shop (and varying them) they are creating a labeled dataset, perfect for supervised learning. Go is a Lab.
It would undoubtedly have been easier to solve this problem using a combination of smart shelves, RFID, scales, etc… It’s often more interesting to know how people didn’t do something, than how they did. Amazon knows that if they were building the stores, yes they could control all aspects of the design and use these easier solutions. But they also know that as of 2017 Amazon Marketplace (third-party sellers) sold more on Amazon than Amazon itself did.
Using Go as a lab they’re creating Amazon Marketplace for third-party physical retailers. Imagine being able to ship a box of a couple of dozen wifi enabled cameras out to a mom and pop shop in Saskatchewan. An installer, like Comcast, comes out and installs the cameras around the store. The chances of the retailer selling something that Amazon doesn’t have pictures of is low, so image registration could be bootstrapped. You could basically convert a small store into a cashierless state of the art retail experience in a weekend. Or maybe Amazon I’m wrong and Go stores are just there to sell more Red Bull. Either way, pretty cool tech.