Home signMedius sign
← Back to Stories

Do You Speak AI?

Have you ever wondered how we deal with AI projects?

What is the secret of making a piece of software that keeps on scaling and growing with your organization? How does it all work and why today machine learning is such an important topic in custom software development? When working with complex software solutions, things can get complicated and communication is key.

Let’s take a deep dive into the terminology we use for successful communication between developers, engineers, data scientists, and clients on AI projects.

What are the most used terms in AI projects?

  1. Sample and labeled sample
  2. Feature
  3. Train and test set
  4. Training phase and predictive phase
  5. Supervised and unsupervised learning

1. Sample and labeled sample

Data is a great start, but what is crucial in AI projects is data samples and how the collected data can be structured, labeled, and prepared for use in machine learning algorithms. Data samples are normally collected into datasets. Each data sample can be labeled with an additional description (class) or numerical values which represent the targets we are trying to model as outputs.

Example: If we want to classify pictures of apples and pears into two separate directories, we have to develop a classifier which detects objects in pictures. Samples are normally collected in datasets, where each data sample (pictures of apples and pears) is additionally labeled with labels describing objects in pictures.

Data samples in datasets are one of the most important parts we need before we start developing a classifier or regression model with machine learning. There’s a saying "Your machine learning models are as good as your data.", so even though in real life we always deal with insufficient or missing information, when it comes to AI projects this lack of data can lead us to suboptimal classifiers or regression models.

2. Feature

Each data sample has its own characteristics in the dataset. When we’re talking about the differentiation of apples and pears, we should ask ourselves how humans are recognizing apples and pears and what is the main difference that would help us with classification. First, there are colors. The apples are green, red, and yellow, and the pears are normally green and yellow. If we would try to classify pears and apples only by color we wouldn’t be so accurate, since colors can be similar.

So what is another characteristic that people use to distinguish between apple and pear? Shape, of course! Pears have a "pear” shape and apples have a more rounded shape. With such a process what we are extracting from each data sample - a picture in our example - are common characteristics of each data classification class. Each picture is additionally described with color and shape of an object. Such characteristics are called features.

Features are then transformed into computer-friendly encodings which computer and machine learning algorithms can easily process, normally joined in feature vectors. In other words, the features are computer-friendly representations of characteristics that are describing specific information.

The feature extraction process is normally an algorithm or set of algorithms that extracts relevant information from data samples automatically. Usually, these algorithms are developed by machine learning engineers or data scientists, with the additional help of expert knowledge about the process we want to model.

3. Train and test set

Other essential terms in machine learning are train and test sets. Both datasets contain samples with concrete labels of what the sample represents. We normally strive to have as many as possible labeled samples, which we then divide into train and test sets by defined proportion. The train set is used to train classifiers or regression models, then the test set is used to validate the trained models.

When splitting data to test and train sets, we have to be careful that samples in the train set are not included also in the test set. Another important thing is the proportion of samples in train and test sets. For example, if we train a very good classifier capable of classifying pears and apples in a picture, and we validate it with test sets containing only pears, we cannot say that it is really capable of classifying apples as well.

We already mentioned that models can only be as good as our data is. If we use pictures of cars and we want to train a classifier that is able to recognize apples and pears, the input images of pears and apples would be classified randomly. Such a classifier would obviously offer only a blind guess whether the object in the picture is an apple or a pear.

When using supervised or unsupervised learning, the data used in our classifier’s training phase should represent the field of what we want to achieve. To summarize the terms train set and test set, we usually train a classifier or regression model with machine learning algorithms on the train set and later validate its performance with a test set.

4. Training phase and predictive phase

There are always two phases when developing a machine learning project. Machine learning is a process in which we train classifiers or regression models to predict the next values or classify an event or data sample. Such models are trained normally for special tasks in real-time prediction or classification applications. When we use these models in a production environment this is called a predictive phase, while the process of developing such models is called the training phase.

We normally want to use all of the available data when we develop a production-ready model, therefore on the last iteration, we join the back train set and test set and train the final model. Previously mentioned train and test sets are necessary only for the process of the validating best-performing algorithm and for the estimation of the best training parameters. In the predictive phase, the developed model or algorithm is taken into production where we use data from real processes, derive our feature value or feature vectors, and use them as input for our predictive model.

We know this may sound a bit complicated so let’s look at the training and predictive phases through an example of classifying apples and pears. (It works with any fruit.)

Example: First, we need a dataset of photos of apples and pears. In the training phase, datasets are labeled, trained, and tested. This is needed to train a model with a machine learning algorithm to decide which is which, pick the best algorithm for our classification task, and estimate the best algorithm parameters. Before we train our final model, we first join back our test and train sets and train the classifier with the best-estimated parameters on all labeled data. Now, all we need is a computer with specialized software for serving our developed model, a camera, and real fruit to be sorted into two different baskets. If we put the apple in front of the camera, the trained model should label the object on the picture with one of the two labels representing the class of apples or the class of pears. In such a case we hope that model will label our apple as an apple. Voila!

5. Supervised and unsupervised learning

In machine learning, there are two different approaches to solving problems.

With the supervised approach of training the machine learning models, we want relatively complex algorithms to associate our input data (features calculated from data samples) with the labels of data samples (also known as targets).

Example: By once again using our example of apples and pears, we try to find a function that maps input data representing apples to label "apple", and input data representing pears to label "pear". Since we train our model with a labeled dataset we are trying to find the mapping function in a supervised way, which will classify most of the labeled apples and pears correctly.

Unsupervised learning is a type of algorithm that learns patterns from unlabeled data. With unsupervised machine learning, we can be efficiently successful in detecting anomalies, and also cluster different clustering processes. Unsupervised learning effectively works with large amounts of data and takes much longer to process, but on the other hand, we save a lot of time compared to labeling a dataset. The difference between both models is that with unsupervised learning the accuracy is usually much lower.

Every real-life application of machine learning has a different amount of labeled and unlabeled data available for training machine learning models. Recognizing both methodologies can help us understand how important labeled data is in the development of machine learning models and their validation measures.

The takeaway

We hope you’ve learned something new today! Understanding the basic terminology when developing a machine learning application is crucial for successful results and well-executed projects. We also emphasized the importance of datasets when we’re developing the most accurate machine learning models, and shed light on the difference between supervised and unsupervised learning.

So, what’s your (next) machine learning project going to be?

We use third-party cookies to analyze web traffic. This allows us to deliver and improve our web content. Our website uses cookies for these purposes only.