This section is about the last step in the data chain. After data analysis and modeling, we need to make the resulting model available for usage by end users. To achieve this, we need to deploy the model into an application, and sometimes we need to represent the application in a more visible way.

Model inference, deployment and compression

As mentioned before, modeling involves two stages: model training and inference. Inference is about using trained model ($y=f(x)$) to make prediction ($y^$) with new input ($x^$). The application of a model is mainly about the model inference. We often say that model training is “offline”, which means it can take a long time and it’s a pre-stage of application, and model inference is “online”, which requires the model to make prediction with low latency.

To use the model, we need to deploy it to an application.

For an Web application, you can simply load the model file and use it for prediction, below is an example for Flask:

from flask import Flask, jsonify, request
import torch

app = Flask(__name__)
model = torch.load(MODEL_PATH)
model.eval()

def get_prediction(image_bytes):
    outputs = model.forward(image_bytes)
    _, y_hat = outputs.max(1)
    return y_hat

@app.route('/predict', methods=['POST'])
def predict():
    if request.method == 'POST':
        # we will get the file from the request
        file = request.files['file']
        # convert that to bytes
        img_bytes = file.read()
        class_id, class_name = get_prediction(image_bytes=img_bytes)
        return jsonify({'class_id': class_id, 'class_name': class_name})

if __name__ == '__main__':
    app.run()

You can also use more advanced deployment approach, such as TF Serving (Please find the tutorial in “Further Reading”).

For mobile Apps, you can deploy the model to the Web and call it. This method is suitable for the huge models, which are not flexible to deploy to the mobile devices with limited storage and computation. You can also deploy the model to the mobile devices. Modern model devices (and many other edge devices) have machine learning inference capability (and even certain training capability), some of them are equipped with specific chips (SoC or Neural processing unit, NPU) Popular frameworks include Pytorch mobile and Tensorflow lite.

We can directly deploy the trained model for inference, but typically the model is huge, which is difficult for the deployment environment with limited storage, power, and computation. Therefore, normally we need to compress the model to a more compact size. There are many methods for model compression, including pruning [1, 2], vector quantization [3], distillation [4, 5], etc.

Visualization

The analytic results and model output are sometimes complex and difficult to be observed and understand directly by the end users. In this case, (data) visualization can help to convey the knowledge by transforming the complex data to easy-to-understand visual components and hints.

Visualization can be very simple (e.g. a line chart), it can also be very complex (e.g. a VR animation). The principle is not about making it fancy, but about identifying the key information you want to present and designing the most suitable type of visualization for it.

For example, below is a visualization for London Underground map. It precisely presents the trace of underground lines. But it is not a good visualization, since it's hard for a passenger to figure out how to transit among different lines.

(Image source: https://deskarati.com/wp-content/uploads/2012/03/geo_tubemap.gif)

Therefore, a more used map is like this:

(Image source: https://i.pinimg.com/originals/66/aa/37/66aa37b0010320b56235876828f39246.jpg)

If you want to observe the number of passengers, you can use the visualization like this: