The other day when I was sitting in my home office, I got an alert from my Nest Doorbell that a package had been delivered — and right from my phone, I could see it sitting on the porch. Moments later, my neighbor dropped by to return a piece of mail that had accidentally gone to her — and again, my Doorbell alerted me. But this time, it alerted me that someone (rather than something) was at the door.
When I opened my door and saw my neighbor standing next to the package, I wondered…how does that little camera understand the world around it?
For an answer, I turned to Yoni Ben-Meshulam, a Staff Software Engineer who works on the Nest team.
Before I ask you how the camera knows what’s a person and what’s a vehicle, first I want to get into how they detect anything at all?
Our cameras run something called a perception algorithm which detects objects (people, animals, vehicles, and packages) that show up in the live video stream. For example, if a package is delivered within one of your Activity Zones, like your porch, the camera will track the movement of the delivery person and the package, and analyze all of this to give you a package delivery notification. If you have Familiar Face Alerts on and the camera detects a face, it analyzes the face on-device and checks whether it matches anyone you have identified as a Familiar Face. And the camera recognizes new faces as you identify and label them.
The camera also learns what its surroundings look like. For example, if you have a Nest Cam in your living room, the camera runs an algorithm that can identify where there is likely a TV, so that the camera won’t think the people on the screen are in your home.
Perception algorithms sound a little like machine learning. Is ML involved in this process?
Yes — Nest cameras actually have multiple machine learning models running inside of them. One is an object detector that takes in video frames and outputs a bounding box around objects of interest, like a package or vehicle. This object detector was trained to solve a problem using millions of examples.
Source