Electron microscopy
 
Data Labeling and Annotation in Supervised Machine Learning
- Python for Integrated Circuits -
- An Online Book -
Python for Integrated Circuits                                                                                   http://www.globalsino.com/ICs/        


Chapter/Index: Introduction | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | Appendix

=================================================================================

In supervised learning, a "label" refers to the output or target variable that the machine learning model is trained to predict. Supervised learning is a type of machine learning where the algorithm learns to make predictions or decisions based on input data, which is paired with corresponding output labels. These labels are also sometimes called "targets" or "ground truth."

Here's a breakdown of key terms in supervised learning:

  1. Input Features (X): These are the variables or attributes of the data that are used as input to the model. For example, if you're building a model to predict house prices, the input features could include factors like square footage, number of bedrooms, and location.

  2. Output Labels (Y): These are the values the model is trying to predict or classify. In the case of house price prediction, the output labels would be the actual sale prices of the houses in your dataset.

  3. Training Data: This is the dataset that the machine learning model uses to learn the relationship between the input features and the output labels. It consists of pairs of input-output examples, where each example includes input features and their corresponding output label.

  4. Supervised Learning: This is the learning paradigm where the model is trained on a labeled dataset, and its goal is to learn a mapping from input features (X) to output labels (Y) so that it can make accurate predictions on new, unseen data.

The training process involves adjusting the model's internal parameters to minimize the difference between its predictions and the actual labels in the training data. Once the model is trained, it can be used to make predictions on new data where the output labels are unknown.

For example, if you trained a supervised learning model to predict house prices, you could input the features of a new house (e.g., square footage, number of bedrooms) into the model, and it would provide a predicted house price based on the patterns it learned during training. The accuracy of these predictions is evaluated by comparing them to the true, known house prices from the training dataset.

Training data sets require several example predictor variables to classify or predict a response. In machine learning, the predictor variables are called features and the responses are called labels. Domain experts must inspect the analyzed data further to ensure that the ground truth labeling is accurate for different reasons. For instance, they can be likely to be mislabeled due to the accuracy of algorism or training.

Vertex AI is a great platform to build some templates for machine learning, which is Google Cloud’s unified artificial intelligence platform that offers an end-to-end ML solution, from model training to model deployment. We can add a lot of data (more than 1000GB) and it's scalabe.

Note that labeling text data requires signifcant human resources.

Vertix AI

Figure 4267. Vertex AI providing a unified set of APIs for the ML lifecycle. [1]

High-level overview of the proposed ML method in the publication

Figure 4105b. Overview of the proposed ML method in the publication. [2]

============================================

Text classification based on the values in ColumnA to predict the values for ColumnB. To achieve this, a text classification model is used below. In this example, a simple Multinomial Naive Bayes classifier from the sklearn library is applied to classify the new string in ColumnA and predict the corresponding value for ColumnB. This uses the trained model to predict values for a new string from the CSV file. Note that for more complex scenarios, more advanced text classification techniques and more training data are needed. Code:
         Naive Bayes classifier
       Input:  
          Naive Bayes classifier
       Output:  
          Naive Bayes classifier

The code above belongs to the Multinomial Naive Bayes algorithm. In this code, labeling is represented by the following lines of code:

          y_train = df['ColumnB'][1:]

In this line, you are extracting the values from the 'ColumnB' of your dataset (df) and assigning them to the variable y_train. These values in 'ColumnB' typically represent the labels or target values that you want your machine learning model to predict.

In machine learning, labeling (or target values) is the process of assigning the correct output or category to each input or data point. In this script, y_train contains the labels that correspond to the training data in X_train, and they are used to train the Naive Bayes classifier (clf).

============================================

 

         
         
         
         
         
         
         
         
         
         
         
         
         
         
         
         
         
         

 

 

 

 

 

 

 

[1] Diagram courtesy Henry Tappen and Brian Kobashikawa.
[2] Dan Ofer, Machine Learning for Protein Function, thesis, 2018.

 

 

=================================================================================