Using a machine-learning model in your iOS app

By Will Braynen

Illustration by Donna Blumenfeld

Illustration by Donna Blumenfeld

On and off, thinking of it as formal epistemology sometimes cognitively informed (and I dig epistemology and cognitively informed stuff), I've been following machine learning since the early 2000s ever since my friend Kat got seriously into it.  And so, I wanted to avoid talking about machine learning on this blog because of how much of a buzzword it has suddenly become.  From Apple's WWDC announcements, I got the impression that there is now machine learning in my coffee.  But iOS 11 is scheduled for release tomorrow, September 19, 2017, so I guess it's time to talk about it.

So, earlier this month while back in New York and as an appetizer for try! Swift, a professional community conference organized by Natasha the Robot (Natasha Murashev in real life), I sat in on Meghan Kane's wonderful workshop.  Taking Meghan's git repo and filling in a few things, I tried out a few freely available machine learning (ML) models in an iOS Simulator.  What impressed me was how little of actual machine learning someone had to understand to use an ML model in their app.  Apple did a great job black-boxing all real complexity from the iOS developer, making ML extremely accessible to any good iOS dev properly motivated.

Below I share some anecdotal results of using an already-trained model in an iOS app, as well as some engineering considerations.  And if there interest in the code, I would be happy to share some of it in a future post.

 

what's a model

To not get confused too early on: by "model", in this article I do not mean the M in MVC or MVVM.  I mean a black box that's used to classify things or make guesses about its non-discrete properties or to predict other things correlated with it.

This black box has already seen lots of data before you got your hands on it and was trained on that data using some ML algorithm.  A model, which you can now use to make sense of the world, is the result of that training process.

The model might take some data as input and output a label.  The input to the model might be an image.  The output might be the model's guess about what's in the image, for example: "person", "dog", "house", "outhouse".  Or a guess about what emotions a face expresses like "happiness" (or "happy" below), "sadness" (or "sad" below), "angry", "fear", "disgust", or "neutral" (below).  The output might even be a bounding box; if a bounding box is returned, then a face was detected; and a bounding box is certainly more informative than labels "face"/"no face".

A model is really a hypothesis about the real world.  It is a representation.  One set of real-world objects can have more than one model that represents it.  This means that some models are better than others.

 

taking it out for a spin

Here are screenshots from my iOS simulator of the sort of thing you can do in your app using Swift in Xcode 9:

✓ Yup, my little nephew is happy here.

Yup, my little nephew is happy here.

No no, we weren't fighting. We were actually trying to take a selfie at the Rose Garden in Portland and I think Rose was fixing her hair. But I agree with the model: exposing teeth as a way of friendly gesture is definitely a strange social practice…

No no, we weren't fighting. We were actually trying to take a selfie at the Rose Garden in Portland and I think Rose was fixing her hair. But I agree with the model: exposing teeth as a way of friendly gesture is definitely a strange social practice; I mean teeth are for biting.

It failed to detect the two sideways faces when all four of us were in the picture. And why is my nephew sad? (I thought he was calm and content.) Although once I removed the happy guy and the sad guy from the picture, it found the little troublemak…

It failed to detect the two sideways faces when all four of us were in the picture. And why is my nephew sad? (I thought he was calm and content.) Although once I removed the happy guy and the sad guy from the picture, it found the little troublemaker.

Yay, it found my face! I guess it wasn't trained on tiger faces.

Yay, it found my face! I guess it wasn't trained on tiger faces.

It's true. This is what neutral looks like on the Q23 bus on Queens Boulevard. (Body language was outside the bounding box.)

It's true. This is what neutral looks like on the Q23 bus on Queens Boulevard. (Body language was outside the bounding box.)

Why a smaller bounding box? And why this one angry and the other neutral? (Slightly different aspect ratio, which might explain the bounding box; if so, angry because of bounding box or aspect ratio?)

Why a smaller bounding box? And why this one angry and the other neutral? (Slightly different aspect ratio, which might explain the bounding box; if so, angry because of bounding box or aspect ratio?)

7.jpg
Aspect ratio was distorted because the image was scaled to fill. So now he is happy.

Aspect ratio was distorted because the image was scaled to fill. So now he is happy.

The images I threw at the emotion model were tough.  As you can see, accuracy matters.  And so does robustness.  With any model, you should generally expect false positives and false negatives; and, while you want the model to draw nuanced distinctions, you don't want it to be finicky and overly sensitive.  Some models will, for each guess, tell you how confident they are, which is helpful.

I suppose if you need to know how confident a model is and it's not telling you, you could write a testRobustness function which programmatically tweaks the aspect ratio of a picture by stretching it up or sideways a tad and then iteratively passes the image to the model; if the model keeps giving you the same answer, then the answer is robust and you could say the model is pretty confident.  Otherwise, if you consider slight changes in aspect ratios to be noise (as I think you should), the classifier's (model's) performance is not stable after adding a bit of noise to the data, at least not for certain types of data.

 

Combining models to work together

What you see in the screenshots above is two ML models working in tandem.  The two ML models are chained together, with the output from the first model fed as input into the second model.  The first model is the one that finds a face in a picture and gives you a bounding box.  You then take the bounding box and crop the image, so it is just a face and only one face.  You then give the new cropped image to the second model.  The second model is the one that was used to classify faces as happy, sad, neutral, or angry.

If you are not convinced that the built-in bounding-box finder uses a model, then remember that (a) it detects faces (so it had to be trained to do that, likely using a deep neural network with some convolutional layers and some layers fully connected) and (b) you could probably import a third-party model to give you bounding boxes instead of using the one that's built in if, for example, Apple's face detector does not work well on your data—like if you need bounding boxes around tiger faces.  As for the emotion model, if you were training one yourself, then perhaps you could make the detect-face-then-draw-bounding-box-and-then-crop bit a preprocessing step built into the black box that houses your model.  Maybe you could even add a spatial transformer to your emotion classifier.

But back from the clouds.  It is important to remember that we are evaluating two models working in tandem because we might be judging visually using cues from outside the bounding box (are we?), whereas all the emotion model is getting as input is what's inside the bounding box.

This is what we see.

Because my nephew was sad (see pic above), I reran the classifier on a zoom-in version of that photo.

Because my nephew was sad (see pic above), I reran the classifier on a zoom-in version of that photo.

A slightly different aspect ratio. Unclear if zoom made a difference. (I didn't run controlled experiments or anything.)

A slightly different aspect ratio. Unclear if zoom made a difference. (I didn't run controlled experiments or anything.)

I was actually trying to look concerned! But it's true: I was happy inside.

I was actually trying to look concerned! But it's true: I was happy inside.

This is what the emotion classifier sees, minus its labels:

12.jpg
13.jpg
14.jpg

Reading emotions is difficult business.  We often rely on context.  Body language, tone of voice, a backstory.  But it's true that I was happy that day.  And perhaps my nephew held all those emotions too—who is to say he wasn't a little bit of each: angry, happy and sad.

 

Where to get a trained model

Apple answers this question with an intro page called "Getting a Core ML Model."  It's a great page to visit and proceed from, but it looks like it was written not to overwhelm and intimidate.  So, to zoom out a little, your options are:

Option 1.  Bounding boxes and other "basic" built-in functionality aside, Apple also gives you six trained models at the bottom of their ml landing page.  Each model is packaged as an .mlmodel file.  You could take one of those.  (I think they might all be academic open-source models; I don't think Apple trained or retrained any of them.)

Option 2.  You can also use coremltools, a Python utility, to convert from a Caffe model (.caffemodel) to a Core ML model (.mlmodel).  This is probably what Apple did anyway to give you the six models in Option 1.  Or you could write your own conversion tool.  So if you find a trained Caffe model you like, you could convert it.  The emotion classifier above is an example of that.

Option 3.  You could subscribe to an API like Google's Vision API or Amazon's Rekognition API and upload the data you want to classify using the API.  For example, a user takes a picture with her smartphone and your app uploads this picture to a Google or Amazon server by POSTing or PUTing to their API's endpoint.  So last but not least, you could get a trained model by paying someone who is monetizing their trained models and offering them as a service in the cloud, which would mean that you would be performing classification off the mobile device (so not natively).

Option 4.  You could also train your own model, at home on your own hardware (probably not your laptop) or on a cluster of ec2 instances up on AWS.  GPUs are good to utilize.  Obviously do not do any serious training on a mobile device; you'll probably need more juice than that.  Once you've trained it, you have two options: (a) you could convert it to a .mlmodel file and put it on the mobile device, or (b) you could house it in your own data center (if you are a big company that has one) or in the cloud (e.g. on AWS) and serve it to your app through a REST API just like Google and Amazon do.

 

Why perform classification (or regression) on the MOBILE device and not in the cloud

Assuming your users' mobile hardware is fast enough...  

Doing classification off the device and in the cloud—e.g. by subscribing to someone else's classification API— is the least amount of coding.  You just upload the image (or other data) to the cloud and get back the results.  If you know how to make a POST http request, you know how to use an ML API.

However, the amount of coding on the device now is pretty minimal too, thanks to Apple for packaging it all up for us.  Moreover, doing classification off the device has three disadvantages:

  1. It is difficult, if not impossible to use it with video if you want to do classification in real time. Too many images too fast.

  2. It obviously also would not work if you are offline because how would you then upload your data (e.g. image)? Poor connectivity—same. And uploading images in batch once back online might be too prohibitive and not real time either.

  3. You have privacy concerns (moral concerns) or privacy compliance requirements (legal constraints) and want to keep the data on the device.

But if the kind of classification you need to do still takes too long on the device or your model file is too big and you don't mind the subscription fee (or can and want to build your own classification API) or if the needed data is coming from multiple devices, then maybe classification in the cloud is for you.

 

other considerations

Release cycle and deployment: what if you expect to get a better model after you release your app?  This is certainly true for Amazon's Alexa, the little black tower you are supposed to talk to that turns an angry red when you tell it to not eavesdrop by muting its mic.  Alexa's speech-recognition model is likely still in training.  It is already a model that can do some stuff, but it is also improving.  If the training is happening elsewhere—not on the device and not on data local to the device—but you want to do classification on the device with a .mlmodel file, then you would have to do another release of your app.  That's certainly an option.  Or, instead, you could have the app download the new model using URLSession or CloudKit and then recompile it using the compileModel method in the MLModel class (part of CoreML framework).

Does Core ML only support convolutional neural networks?  Short answer: no. Convolutional neural networks are all the rage because they do a good job with computer-vision recognition tasks.  It's what deep learning uses.  ("Deep" refers to the length of the neural net; as in, a bunch of layers between the input layer and the output layer.)  But Core ML supports other types of ML models too, including random decision forests and Vapnik's support vector machines.  Not sure where Bayesian ML fits in though.

Is the model classifying or predicting?  In machine learning, the line between classifying (or regressing) and predicting is blurry.  To an ML model, classifying and predicting are really the same thing.  The big difference between classifying and predicting is time and whether you are dealing with properties and identity or, instead, with identity and future events.  But that's not really part of the model.  Or so is my impression of ML, but I am no expert.  (And sure, time can be a feature in the model, but I don't see how that's enough for a real conceptual distinction.)  And of course there is also this crazy sort of thing that you could do with a neural net.

 

Takeaway

Machine learning (ML) models are becoming consumer-grade tech accessible to a software developer instead of a computer scientist or someone else with a computational mathematical background.  Ten years ago, this was all possible (and Facebook has been doing something like this for a while with their tag-a-friend feature), but would've taken me a lot longer to implement.  But today, thanks to a lot of academic work, trained open-source models are packageable and shareable.  And thanks to Apple, it's now easy to programmatically use them on your iPhone.  With iOS 11 out, expect to see a lot more apps that utilize ML models!  (Same goes for all the folks trying to monetize on exposing trained models through REST APIs, which means an opportunity for them to sell their cloud compute at a markup.)  And also expect a lot more people (i.e. iOS devs) tomorrow saying that they know ML when they don't really; so, my machine-learning friends, please have patience with the rest of us.

Oh, and forgot to mention: Apple added extra support for computer-vision tasks and for natural-language-processing (NLP) tasks, plus something called GameplayKit, that sit on top of the CoreML framework.  So vision and NLP, and whatever GameplayKit does, will be the low-hanging fruit.  But in principle you can do other types of tasks too—might just be a little more work.

 

All photographs used in this article were used with the consent and permission of all pictured.

References