Nicholas R. Johansen – nicjo16@student.sdu.dk

Mattias Damsgaard – mdams16@student.sdu.dk

Introduction

The idea is to be able to recognize objects in real time by using a model and vision kit from Apple. If possible, an additional feature with face detection will also be implemented. Furthermore, the idea that this application is involved in the subject of machine learning, which has a huge, present and current range of applications. One being the drone industry which is trying to make their drones fully autonomous for power line inspections.

Naturally, with such as complex topic, the knowledge required to fully grasp it in its entirety is tremendous. Therefore, this application is merely a simplified version of the ideal solution of an application capable of being tailored to the specific needs of the company. To be able to understand the subject one would at the very least be required to have a basic understanding of machine learning, face recognition and object recognition.

Our objectives are as follows:

  • Simple to understand, as it resembles well-known elements from iOS
  • Make use of object recognition
  • Possibly make use of face recognition
  • Draw as many relevant subjects from lectures into the application

Methods and Materials

Conceptual design

The brainstorming phase was rather quickly skipped over as the idea was already formed beforehand and thus the conceptual design and simple prototyping phase began fairly immediately. The conceptual design would be used to show the users, how our app will work, but also give us an idea about how we should develop our application.

Requirements and MoSCoW

Requirements were needed in order to know what our app should be able to do, and what was most important when developing it. This was also a good thing to have when we got evaluations, as the users could say what they thought was most important for the app to be as user intuitive as possible. As some things might give us trouble and time is a factor, we also used MoSCoW to prioritize the features of the app.

Use-case diagrams

The use-case diagrams were used to identify what features the app should have, and how they exactly should work. This gave us an overview of what exactly needed to be developed.

Evaluation and prototyping

Multiple user tests and demonstrations were conducted to get feedback from users. This was done to ensure, that it was easy for people how had never tried the app, to use it. This led us to develop a prototype and continue to evolve the prototyping from evaluation to evaluation.

Implementation

The CoreML model, ResNet50, is able to detect the dominant objects in the current image displayed by the camera and is able categories them into its 1000 available categories. This model is what the whole application is built around.

https://developer.apple.com/machine-learning/build-run-models/

Results

Here the result of some of the methods is displayed.

A demo of the app can be seen here: https://youtu.be/xT7NAsUK97g

A link to the code of the app can be seen here: https://bitbucket.org/iosprogrammingsdu/objectrecognizer/src/master/

Conceptual design

The conceptual design gave us an idea about how the app should look. It was all about figuring about how to access the different part of the app. A lot of work had to be here thinking about how the whole app should look and be designed. As we wanted the app to look as close as possible to the iOS stock camera app, we also had to have this in mind.

We decided to make a history button, that would lead the user to a new view, where other features could be implemented, like the face scan feature, but also if we in the future wanted other things in the app. We tried out different designs to make its design follow the iOS design aesthetics as much as possible. We ended up with the following:

The left-hand side shows what the camera sees. When it detects something it recognizes, it shows a bar in the bottom with the information about it. This can be seen in the middle picture. The history menu can be seen on the right side. This is the view where we could implement further features.

Requirements and MoSCoW

To further help build up the app some requirements were found during the project.

Learnability

Since the app has a strong resemblance to the camera built into iOS, it will be very easy to use the first time.

Efficiency

The core functionality of the app does not require any user input aside from launching the app itself, thus making it extremely easy to use. Possible extra features are no more than a few clicks away. Again, making it very easy to use.

Memorability

With its strong resemblance to the camera app, we hope users will remember it as a “camera” clone and therefore it should not be “forgettable”.

Errors

The core functionality of the app can’t have any user error as it does not explicitly require user input but the extra functionality of taking a photo and scanning it for faces is another story.  However, the user might take a photo that does not contain faces. In order words, some errors might occur, but it is unlikely to be user generated.

Satisfaction

The satisfaction level is very high as it has a very strong resemblance to a well-known feature of iOS.

Functional requirements

  • Object recognition
  • Show results

Non-functional requirements

  • Usability
  • Reliability
  • Performance

The performance requirement led us to think about when we should turn off the camera and scanning feature of the app. As of right now it is running, even when you enter the history menu, and thus uses the extra power for nothing. This should be changed in later iterations.

MoSCoW

Must: Recognize objects and display results

Should: Add camera, library and face recognition

Want: Maybe some categories for photos based on the object and/or object recognition.

Use-case diagrams

The use-case diagrams gave us a quick overview of the features the app should have, and also told us what actors should be used where.

Evaluation and prototyping

After the conceptual design was made, requirements were found and the use-case diagram was made, we evaluated the whole design by asking people in the class. People thought the design was very similar to the design of the stock iOS camera app, and they liked it and thought it would be easy to use. So far so good.

We started making the first prototype. It was quickly discovered that there was no need for the left picture from the conceptual design, as the scanner would always capture something. Our first prototype was just with the camera on, showing some text in the bottom with the result. We implemented a bottom in the top right corner for the history. We got this prototype evaluated in the class again, and once again people thought it was nice and easy to use, but several people didn’t like the placement of the menu button in the top right corner.

We evolved our prototype and decided to place the menu button in the left bottom corner. This way we could save some room in the top of the screen, but it would also make the app look even more like the stock iOS camera app, as it also has a button for history there. We polished the scanner view so it looked even nicer, and started making the views for the history and also the face scan feature.

We got evaluation again in the class, and people were very positive about the whole scanner, but some would like to be able to accept if the photo that was taken would be saved or not. The feature wasn’t implemented yet, but we told that you just had to hit the shutter button and the picture would be saved.

We worked on the prototype again and decided to add an accept view. So when you took a picture, it would show, and you would get the option to either cancel or save the picture. This seemed like a good way of doing it. We had a lot of trouble saving the scanner data on the picture but eventually got it to work. We weren’t able to save the pictures in the history view but instead decided to save the pictures to the phone instead as a workaround. Not optimal, but a working solution at least. The face scan feature gave us huge trouble, and as time was running out, this would not be a feature in the latest version of the app so far.

We decided to test the app with some friends. They were positive about the design overall. They would have liked for the photos to be saved inside of the app instead of in your actual photo library, and they weren’t that impressed with the results of the scanner, but that is because of the framework used.

Implementation

To get the whole camera running we used the AVCaptureSession() in the viewDidLoad(). We set it to use the photo preset, got the information of the device for the input and started the session. We added this to a new layer and added the layer to the view, so it would be shown in the view. We also add the output to the session. All of this can be seen here:

override func viewDidLoad() {
        super.viewDidLoad()
 
        let captureSession = AVCaptureSession()
        captureSession.sessionPreset = AVCaptureSession.Preset.photo
 
        guard let captureDevice = AVCaptureDevice.default(for: .video) else { return }
        guard let input = try? AVCaptureDeviceInput(device: captureDevice) else { return }
        captureSession.addInput(input)
 
        captureSession.startRunning()
 
        let previewLayer = AVCaptureVideoPreviewLayer(session: captureSession)
        view.layer.addSublayer(previewLayer)
        previewLayer.frame = showCameraView.frame
 
        let dataOutput = AVCaptureVideoDataOutput()
        dataOutput.setSampleBufferDelegate(self, queue: DispatchQueue(label: "videoQueue"))
        captureSession.addOutput(dataOutput)
        }

We use the ResNet50 model as can be seen below. the firstObservation is used to get the identifier object and the confidence level. These values are assigned to different labels and are shown in the view to show the user what the scanned object is and how certain it is. The photoInformation is used later, to save the information to the picture taken.

guard let pixelBuffer: CVPixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) else { return }
        guard let model = try? VNCoreMLModel(for: Resnet50().model) else { return }
        let request = VNCoreMLRequest(model: model) { (finishedReq, err) in
 
            guard let results = finishedReq.results as? [VNClassificationObservation] else { return }
            guard let firstObservation = results.first else { return }
 
            print(firstObservation.identifier, firstObservation.confidence)
 
            DispatchQueue.main.async {
                self.scannedResultLabel.text = "\(firstObservation.identifier)"
                self.chanceResultLabel.text = "\(firstObservation.confidence * 100)"
                self.photoInformation = "\(firstObservation.identifier), \(firstObservation.confidence * 100)"

When the shutter button is pressed, the only thing the method does is setting the boolean takePhoto to true. In the captureOutput method that keeps running, it starts by checking if this boolean is true. If it is true, it sets the boolean to be false again and then gets the image currently showing, changes view, and shows the image. It also adds the text to the picture with the photoText method, which will be shown later. The whole things can be seen down below:

if takePhoto {
            takePhoto = false
 
            if let image = self.getImageFromSampleBuffer(buffer: sampleBuffer) {
                let savePhotoViewController = UIStoryboard(name: "Main", bundle: nil).instantiateViewController(withIdentifier: "SavePhotoViewController") as! SavePhotoViewController
                
                let image = photoText(WriteInfo: photoInformation! as NSString, Image: image, Point: CGPoint(x: (image.size.width/2-(textWidth/2)), y: 20.0))
                savePhotoViewController.takenPhoto = image
            
                DispatchQueue.main.async {
                    self.present(savePhotoViewController, animated: true, completion: nil)
                }
            }

Saving the scanned data to the picture is done with the method below. We start by defining some values and attributes for the text. We create a CGRect and draw the text send to as writeInfo in the CGRect. The textWidth is a global variable that is used when the photoText method is called, so the CGRect will be centered in the picture.

func photoText(WriteInfo writeInfo: NSString, Image image: UIImage, Point point: CGPoint) -> UIImage {
        let textFont = UIFont(name: "Helvetica", size: 20);
        let imageContext = UIScreen.main.scale
        
        UIGraphicsBeginImageContextWithOptions(image.size, false, imageContext);
        
        let attributes = [
            NSAttributedString.Key.font: textFont as Any,
            NSAttributedString.Key.foregroundColor: UIColor.white,
            NSAttributedString.Key.backgroundColor: UIColor.black
            ] as [NSAttributedString.Key : Any]
        image.draw(in: CGRect(origin: CGPoint.zero, size: image.size))
        
        let position = CGRect(origin: point, size: image.size)
        textWidth = position.width
        writeInfo.draw(in: position, withAttributes: attributes)
        
        let infoImage = UIGraphicsGetImageFromCurrentImageContext()
        UIGraphicsEndImageContext()
        
        return infoImage!
    }

To explain the object recognition part of the application, an object diagram can be found below:

The camera is opened as shown at the top level. The second level the camera might identify a lot of objects with various levels of certainty of what it is seeing is actually what it is. As an example, it might see a mac, and the program might be 81% sure it is actually a mac. On the bottom row, a categorization takes place. Continuing the example, a mac is part of the category Computer or Laptops. This information should then be stored in the library of the app if the user chooses to save the info. To make this happen the Vision package is used. Thus, the main part of the project is the Vision package or kit which can be directly imported as it was incorporated into swift 4 at its beta. For the object recognition, a model from Apple is used which works together with Vision to make it work. The model ResNet50 is essentially “just” the training data and uses CoreML.

Discussion

The original idea of just being an object recognizer was accomplished. The face scanner was proved more difficult than first anticipated, and though we had the face scan code ready, we just couldn’t find a way to incorporate it, so that it worked. Because of this, we were not able to get it into the current build of the software. Also, the history button in the “scanner”-view, should have been made like the iOS camera app, where it shows the last taken picture as the button. We were able to save photos with the scanned result, but not in the app directly as was the original idea.

All of our evaluations have been very good. Generally, people think our app is very easy to use. Of course, there isn’t much to do, but the resemblance of the iOS camera app makes it a lot easier also, as people are familiar using it.

Good things about the Object Recognizer is the ease of use and it’s the familiar user interface. The implemented vision doesn’t recognize much though, and apart from recognizing very familiar items, it will very often fail. This could easily be fixed by implementing another, more advanced framework or model.

As the framework or model isn’t that powerful, sadly our app isn’t very competitive compared to our rivals.

Things that need to be done in the future would be implementing the face scanner feature, make the history button show the last taken photo, implement a much better framework for recognizing apps and make it able to save photos of things you have scanned in the app itself. That would at least complete the basic step of the object recognizer.

All in all, the project was a success. We got a lot of knowledge working with the app. We ended off with a simple, but an intuitive app. This was made possible by learning much more about how iOS works and all the design aesthetics that is in the software system. Though not very advanced, having a working app that can recognize things is pretty cool.

Leave a Reply