Development Flow

Note

This document has been machine translated.

From here, we will explain the module development flow in the scenario of “modularizing PyTorch's Basic MNIST Example. MNIST is an image dataset with drawn handwritten numbers 0 - 9. Basic MNIST Example (see link below) is a program that learns this dataset and predicts the drawn numbers.

https://github.com/pytorch/examples/tree/main/mnist

Please refer to this procedure as a base for developing the required functionality as a module.
A complete set of examples including this document can be downloaded from here.

Specification study and external design

1. decide on a module name

Decide on a module name that does not overlap with other modules. In this example, we use mnist_module.

2.Decide on the method structure.

Decide which methods you want the module to have. In this example, we want the module to perform training and prediction processing, so we will use the following two methods.

train: train the model
predict: Prediction by learned model

3. determine the input/output of the methods

Determine the input/output for each method. Considering the contents of MNIST's training and prediction processes, the inputs and outputs are as follows.

train (input: labeled image data)
- input: Labeled image data (for training)
- output: trained model (CNN)
predict (CNN)
- input: labeled image data (for prediction)
- input: trained model (CNN)
- output: Prediction result data

4. determine the schema for input/output data

Both inputs and outputs (input/output) of the module function will be stored in tables or model stores in EvWH. Of the two, the schema must be determined for the data stored in tables. For data stored in the model store (trained models), no schema needs to be considered.

Labeled image data
- image_id integer
- label integer
- image text (base64 encoded string of image binary data)
Prediction result data
- image_id integer
- label integer
Learned model (CNN)
- (no schema review required)

Development Process

1. create directory structure

First, create the MNIST module directory directly under the module root directory. In this case, the configuration is as follows. The files do not need to be created at this stage, since we will explain the description of the files in the next section.

- mnist_module/
  - docker-compose.yml
  - Dockerfile_train
  - Dockerfile_predict
  - Dockerfile_get-definition
  - src/
    - train.sh
    - predict.sh
    - definitions/
      - definition.json
    - mnist/
      - train.py
      - predict.py
      - requirements.txt
      - ...

Create a file template along with the configuration of the methods.

First, create a minimum configuration to run the train and predict methods, which do nothing, with the Dataproc API as a template.

Creating the minimum configuration

To create a minimum configuration,

docker-compose.yml, Dockerfile, and entry points (train.sh, predict.sh) required for the docker compose command to work
definition.json required for the module to work with the Dataproc API.

required for the module to work with the Dataproc API.

docker-compose.yml (completed)

services:
  train:
    build:
      context: .
      dockerfile: Dockerfile_train
    entrypoint: ["/work/src/train.sh"]

  predict:
    build:
      context: .
      dockerfile: Dockerfile_predict
    entrypoint: ["/work/src/predict.sh"]

  get-definition:
    build:
      context: .
      dockerfile: Dockerfile_get-definition
    entrypoint: ["cat", "/work/src/definitions/definition.json"]

get-definition is a method (service) that must be included in every module that runs on the Dataproc API, apart from the methods to be implemented.

Dockerfile_train(incomplete)

FROM python:3

WORKDIR /work
COPY ./src /work/src

RUN chmod +x /work/src/train.sh

Dockerfile_predict(uncomplete)

FROM python:3

WORKDIR /work
COPY ./src /work/src

RUN chmod +x /work/src/predict.sh

Dockerfile_get-definition(completed)

FROM alpine:latest

WORKDIR /work
COPY ./src /work/src

train.sh, predict.sh(uncomplete)

#!/bin/sh

definition.json (uncomplete, input/output not defined)

{
  "name": "module.mnist_module",
  "description": "MNIST module",
  "supports": "",
  "definitions": {},
  "scripts": {
    "train": {
      "description": "Train",
      "path": "train",
      "type": "training",
      "input": {},
      "output": {}
    },
    "predict": {
      "description": "Predict",
      "path": "predict",
      "type": "prediction",
      "input": {},
      "output": {}
    }
  }
}

Operation check

Next, call the Dataproc API with a module that does nothing as a behavior check.

docker compose build

and if there are no build errors, run the sample client (mnist_module.zip/samples/sample_client_1.py). If the sample client also completes without error, it is complete as a template.

Prepare input/output for each method

In the previous section, we did not define input/output for learning/prediction processing in definition.json, In this section, input/output is defined so that input data can be obtained from EvWH and output data can be stored in EvWH. train, predict are still left blank.

In the input/output definition, it is necessary to write definitions in two places in definitions.json: definitions and scripts.

Setting definitions

First, in definitions, define the format of input and output data. In this case, we need three types of data: labeled image data (for training and prediction), learned models, and prediction result data.

Labeled image data (for training and forecasting) “mnist_dataset”: Defines labeled image data. This definition is commonly used for both training and forecasting. Data in this format should be input/output to/from EvWH tables in CSV format. For this purpose, set “form”: “directory” and define a CSV file (dataset.csv) under it. There are start_datetime, end_datetime, and location in the CSV schema, but these are columns that are created by default in the data loaded by the data loader and are not used by MNIST.
trained_model “mnist_model”: Defines the learned model. Since we want to input and output data in this format to the EvWH model store, we need to treat it as a set of model files and their metadata (JSON). The model file should be in state_dict format, which can be saved with PyTorch's torch.save(). “form”: “directory”, with the model (mnist.pth) and metadata (mnist.xdata-meta.json) file definitions below it.
predict result data “mnist_predict_result”: Defines prediction result data. This form of data should be input/output in CSV format to/from EvWH tables. To do so, define a CSV file as “form”: “file”.

scripts configuration

In scripts, input/output definitions are written for train/predict, respectively. The following contents are used according to the input/output of the methods defined in the design.

train.input
- labeled image data (for training) “train_dataset”: Sets the input for labeled image data used for training. paramtype should be “mnist_dataset” defined in definitions. The path specifies that the data acquired from DDC will be pre-populated in /work/runs/input/train_dataset/dataset.csv.
train.output
- trained_model “trained_model”: Sets the output of the trained model. paramtype should be “mnist_model” defined in definitions. By specifying path, models and their metadata are automatically stored in the model store if they are output to /work/runs/output/models/mnist.pth, /work/runs/output/models/mnist.xdata-meta.json The models and their metadata will be automatically stored in the model store.
predict.input
- labeled image data (for prediction) “predict_dataset”: Sets the labeled image data input to be used for prediction. paramtype should be “mnist_dataset” defined in definitions. For simplicity, we use the same definitions as for the training data, but the labels are only used to calculate the percentage of correct answers in the prediction process. The path specification causes the data retrieved from DDC to be pre-populated in /work/runs/input/predict_dataset/dataset.csv.
- trained_model “trained_model”: Sets the input for the trained model. paramtype should be “mnist_model” defined in definitions. The path specification will pre-populate /work/runs/input/models/mnist.pth and /work/runs/input/models/mnist.xdata-meta.json with the models retrieved from the model store and their metadata.
predict.output
- predict result data “predict_result”: Sets the output of the predicted result data. paramtype should be “mnist_predict_result” defined in definitions. By specifying “path”, the predicted result data will be automatically stored in DDC if output to /work/runs/output/predict_result/result.csv.

The previous steps complete the preparation of the module including input and output. From here, we will check the operation of the created module.

Preparation for operation check: Create a container (DDC, model store) for inputs and outputs on the EvWH side

The input and output data defined in definition.json are stored in DDC or Model Store. DDC and model store must be created in advance to check the operation.

The following curl command executes various APIs to create a DDC and a model store for the forecast result data. In the following description, APIKey and Secret, which are installed in the initial installation of xData Edge, are used. Please set them according to your environment.

curl \
-H "Content-type:application/json" \
-H "APIKey: a4e16fab-9077-4a13-a7d5-0f1fda63cd16" \
-H "Secret: Do8feigevo8OhCi7OchoSan8zid7Fie3Aez4" \
-X POST -d \
'
{
  "jsonrpc": "2.0",
  "method": "fl.initialize_model_store",
  "params": {
    "model_store_ddc":"ddc:mnist_models"
  },
  "id": "mnist_module"
}
' \
http://localhost/api/v1/fl/jsonrpc

curl \
-H "Content-type:application/json" \
-H "APIKey: 0df23f0d-44cb-4085-8ac9-158e6ced3056" \
-H "Secret: 4AzbY3pvfes33aow49nhjn3aUphv9zFWctcT" \
-X POST -d \
'
{
  "jsonrpc": "2.0",
  "method": "prov.new_ddc",
  "params": {
    "ddc_label": "ddc:mnist_predict_result",
    "columns": [
      {"column_name":"image_id", "data_type":"int"},
      {"column_name":"predicted", "data_type":"int"}
    ],
    "key": "image_id"
  },
  "id": "mnist_module"
}
' \
http://localhost/api/v1/provenance/jsonrpc

In addition, labeled image data is created in the attached script (mnist_module.zip/tools/mnist_loader.py), and the table is created and the data is loaded,

python mnist_loader.py ~/data_dir --kind train --end 6000
python mnist_loader.py ~/data_dir --kind test --end 1000

In addition, we tie the DDC to the target table.

curl \
-H "Content-type:application/json" \
-H "APIKey: 0df23f0d-44cb-4085-8ac9-158e6ced3056" \
-H "Secret: 4AzbY3pvfes33aow49nhjn3aUphv9zFWctcT" \
-X POST -d \
'
{
  "jsonrpc": "2.0",
  "method": "prov.set_ddc",
  "params": {
    "ddc_label": "ddc:mnist_train",
    "source": "event.mnist_train_tbl_0",
    "ddc_type":"link"
  },
  "id": "mnist_module"
}
' \
http://localhost/api/v1/provenance/jsonrpc

curl \
-H "Content-type:application/json" \
-H "APIKey: 0df23f0d-44cb-4085-8ac9-158e6ced3056" \
-H "Secret: 4AzbY3pvfes33aow49nhjn3aUphv9zFWctcT" \
-X POST -d \
'
{
  "jsonrpc": "2.0",
  "method": "prov.set_ddc",
  "params": {
    "ddc_label": "ddc:mnist_test",
    "source": "event.mnist_test_tbl_0",
    "ddc_type":"link"
  },
  "id": "mnist_module"
}
' \
http://localhost/api/v1/provenance/jsonrpc

Preparation for operation check: Creating dummy data for output

At this stage, neither train nor predict has yet been implemented for learning/prediction processing, so we will create dummy data for output for the operation check. By writing the following in train.sh and predict.sh, dummy data will be output at the last echo of each.

train.sh (unfinished)

#!/bin/sh

mkdir "/work/runs/output/models"

MODEL_FILE="/work/runs/output/models/mnist.pth"
MODEL_META_FILE="/work/runs/output/models/mnist.xdata-meta.json"

echo 'dummy_model_data' > "$MODEL_FILE"
echo '{"model_kind": "mnist", "model_state": 10, "round": 1}' > "$MODEL_META_FILE"

predict.sh (unfinished)

#!/bin/sh

mkdir "/work/runs/output/predict_result"

RESULT_FILE="/work/runs/output/predict_result/result.csv"

echo 'image_id,label\n1,1' > "$RESULT_FILE"

Operation check

We are ready to check the operation,

docker compose build

and if there are no build errors, run the sample client (mnist_module.zip/samples/sample_client_2.py). If the sample client also completes without error,

curl \
-H "Content-type:application/json" \
-H "APIKey: a4e16fab-9077-4a13-a7d5-0f1fda63cd16" \
-H "Secret: Do8feigevo8OhCi7OchoSan8zid7Fie3Aez4" \
-X POST -d \
'
{
  "jsonrpc": "2.0",
  "method": "fl.get_models",
  "params": {
    "model_store_ddc":"ddc:mnist_models"
  },
  "id": "mnist_module"
}
' \
http://localhost/api/v1/fl/jsonrpc

curl \
-H "Content-type:application/json" \
-H "APIKey: 0df23f0d-44cb-4085-8ac9-158e6ced3056" \
-H "Secret: 4AzbY3pvfes33aow49nhjn3aUphv9zFWctcT" \
-X POST -d \
'
{
  "jsonrpc": "2.0",
  "method": "prov.get_ddc_records",
  "params": {
    "ddc_label": "ddc:mnist_predict_result"
  },
  "id": "mnist_module"
}
' \
http://localhost/api/v1/provenance/jsonrpc

Execute,

A model is added to the model store (output of train)
You can verify that a record has been added to the prediction results (output of predict).

(output of predict).

Also, when the module is executed, an execution folder (hash folder name) will be created under the folder set in the WORKER_DATA_DIR environment setting (e.g. ~/work/runs/). Two folders should be generated by the learning process and the prediction process, so please make sure that input and output in the folders are generated correctly. Also,

[folder where training process is executed]/input/train_dataset/dataset.csv
[folder where the prediction process is executed]/input/predict_dataset/dataset.csv

is used to check the operation in the next section.

4. develop the main body of the process

From here, we will implement the learning and prediction process for the MNIST dataset. This time, we will use the Basic MNIST Example (see link below) available on GitHub as a base, and adjust the input and output formats for the module.

https://github.com/pytorch/examples/tree/main/mnist

Implementation of learning and prediction process (train.py, predict.py)

The following four files are created under src/mnist/ in order to implement the learning and prediction process.

train.py
predict.py
common.py
requirements.txt

Please refer to the attached file (mnist_module.zip/src/mnist/) for the completed code. Only key points are explained here.

Learning process (train.py)
- The dataset argument allows you to specify the input source CSV file path for the labeled image data (for training).
- In the original implementation, the training data is downloaded on-the-fly, but this can be changed to read from the CSV file specified in the argument. In this case, image data is obtained after base64 decoding.
- The --model argument allows you to specify the output file path of the trained model.
Predict processing (predict.py)
- The dataset argument allows you to specify the input source CSV file path for labeled image data (for prediction).
- The original implementation downloads the dataset for training on-the-fly, but we will change it to read from the CSV file specified by the argument. In this case, image data is obtained after base64 decoding.
- The --model argument allows you to specify the input file path of the trained model.
- The -dataset argument allows you to specify the CSV file path to output the prediction results.

Creating Entry Points (train.sh, predict.sh)

Next, create a shell script to execute train.py and predict.py. In the shell script, pass specific values (file paths) to the arguments defined in train.py and predict.py.

train.sh(completed)

#!/bin/sh

mkdir "/work/runs/output/models"

SCRIPT_FILE="/work/src/mnist/train.py"
MODEL_FILE="/work/runs/output/models/mnist.pth"
MODEL_META_FILE="/work/runs/output/models/mnist.xdata-meta.json"
DATASET_FILE="/work/runs/input/train_dataset/dataset.csv"

python "$SCRIPT_FILE" --model "$MODEL_FILE" --epoch 2 "$DATASET_FILE"

echo '{"model_kind": "mnist", "model_state": 10, "round": 1}' > "$MODEL_META_FILE"

predict.sh(completed)

#!/bin/sh

mkdir "/work/runs/output/predict_result"

SCRIPT_FILE="/work/src/mnist/predict.py"
MODEL_FILE="/work/runs/input/models/mnist.pth"
DATASET_FILE="/work/runs/input/predict_dataset/dataset.csv"
RESULT_FILE="/work/runs/output/predict_result/result.csv"

python "$SCRIPT_FILE" --model "$MODEL_FILE" "$DATASET_FILE" "$RESULT_FILE"

Operation check

The created learning/prediction process is tested in a local environment. First, let's check the operation,

pip install -r requirements.txt

to install the necessary packages for your Python environment. (If you want to use a virtual environment, use venv or similar if necessary.)

To check the training process, place the CSV file of the labeled image data (for training) in the file path specified by DATASET_FILE in train.sh. The CSV file can be the file [learning process execution folder]/input/train_dataset/dataset.csv generated in the operation check described in the previous section. Then execute the shell script,

train.sh

If the model file is generated in the location specified by MODEL_FILE in the same shell script, it is a success.

In the operation check of the prediction process, place the CSV file of labeled image data (for prediction) in the file path specified by DATASET_FILE in predict.sh in advance. The CSV file can be [prediction process execution folder]/input/predict_dataset/dataset.csv generated in the operation check described in the previous section. Furthermore, move the model file output by the learning process to the path specified by MODEL_FILE in predict.sh. Then execute the shell script,

predict.sh

If the prediction results are generated at the location specified in RESULT_FILE of the same shell script, then it is successful.

5. containerize the main body of the process

Next, the learning and prediction process implemented in the previous section is made to run as a Docker container. In Dockerfile_train and Dockerfile_predict, add a description to install the packages required to run the learning and prediction process.

Dockerfile_train(completed)

FROM python:3

COPY ./src/mnist/requirements.txt /tmp/

RUN pip install --upgrade pip
RUN pip install --upgrade setuptools
RUN pip install --no-cache-dir -r /tmp/requirements.txt

RUN rm -f /tmp/requirements.txt

WORKDIR /work
COPY ./src /work/src

RUN chmod +x /work/src/train.sh

Dockerfile_predict(completed)

FROM python:3

COPY ./src/mnist/requirements.txt /tmp/

RUN pip install --upgrade pip
RUN pip install --upgrade setuptools
RUN pip install --no-cache-dir -r /tmp/requirements.txt

RUN rm -f /tmp/requirements.txt

WORKDIR /work
COPY ./src /work/src

RUN chmod +x /work/src/predict.sh

(Supplemental) Only requirements.txt is written to be copied independently first to install the package. By doing this, when rebuilding with changes under src, the package installation can be skipped by using the cache, thus reducing the build time.

Continue,

docker compose build

If there are no build errors, the module is ready to run.

6. run the module as a module via the Dataproc API

Finally, run the implemented functionality as a module via the Dataproc API. Since the preparation has already been completed up to this point, run the sample client (mnist_module.zip/samples/sample_client_2.py). If it completes without error,

curl \
-H "Content-type:application/json" \
-H "APIKey: a4e16fab-9077-4a13-a7d5-0f1fda63cd16" \
-H "Secret: Do8feigevo8OhCi7OchoSan8zid7Fie3Aez4" \
-X POST -d \
'
{
  "jsonrpc": "2.0",
  "method": "fl.get_models",
  "params": {
    "model_store_ddc":"ddc:mnist_models"
  },
  "id": "mnist_module"
}
' \
http://localhost/api/v1/fl/jsonrpc

curl \
-H "Content-type:application/json" \
-H "APIKey: 0df23f0d-44cb-4085-8ac9-158e6ced3056" \
-H "Secret: 4AzbY3pvfes33aow49nhjn3aUphv9zFWctcT" \
-X POST -d \
'
{
  "jsonrpc": "2.0",
  "method": "prov.get_ddc_records",
  "params": {
    "ddc_label": "ddc:mnist_predict_result"
  },
  "id": "mnist_module"
}
' \
http://localhost/api/v1/provenance/jsonrpc

Execute,

A model is added to the model store (output of train)
You can verify that a record has been added to the prediction results (output of predict).

The model is added to the model store (output of train) and the record is added to the prediction result (output of predict). Unlike when we checked the operation on the way, this is an actual model and prediction result, not a dummy.

This is the flow of module development based on the Basic MNIST Example. Modules can be developed by replacing the implementation of the main body of the process and input/output settings according to the required functionality.