One of the first questions that people who are considering using AI in their business to improve operational efficiency have is the number of data required for AI training data.

For example, in order for AI to perform tasks such as sorting complaints from customers who use the company’s products or services or classifying product images, it is necessary to prepare training data and conduct supervised learning to train the AI model.

However, the data used for training must be prepared by humans. Those who are going to challenge an AI project for the first time may be wondering how much data they need to prepare when it comes to collecting data.

The most time-consuming part of building an AI model is collecting and organizing the training data.

The work of gathering data scattered in various places, such as PCs and tablets in the office, databases of business systems that we do not usually see, and sometimes documents written on paper, and processing them into data that AI can understand, requires people to work hard, and is a very analog work to promote business digitally by AI. This is a very analog process to promote the AI project.

In most cases, more data than initially estimated will be needed as the AI model is being trained, and additional training data will be added during the project implementation. It is not always easy to quantitatively estimate the cost of implementing an AI model.


The most important factor in estimating the training data needed is the diversity of objects you want to detect in your AI model

There is something to consider in the early stages of an AI project.

That is the diversity of data to be recognized by AI.

By considering the diversity of the data, you can consider how challenging the problem you want AI to solve is.

If the diversity is high, a large amount of training data is required. On the other hand, if the diversity is low, then only a small amount of training data is needed.


Consider data diversity using the example of a fictitious cleaning company project.

It is difficult to define this data diversity in clear terms. However, it is a concept that you can get a sense of when you are touching data and training AI models.

However, it takes time to actually go through the cycle of requiring data, building a model, and improving its accuracy.

Company A provides cleaning services for public facilities such as parks and parking lots. The company is building a system that automatically dispatches cleaners when the image processing AI detects a certain level of contamination in the surveillance camera footage of the facility.

In this case, the AI model input is the image data acquired from the surveillance camera at the site where Company A is contracted to do the cleaning work.

The output of the AI model is the degree of dirtiness. The degree of dirtiness is defined in terms of a 10-point scale, and if the scale exceeds 7, a cleaner is automatically dispatched by the system.

Therefore, the training data would be a set of images of the cleaning site and a human-defined rating of the degree of dirtiness for that image.

The training data would be labeled with the image of the cleaning site as the correct label for the AI’s prediction, e.g., “This image has a degree of dirtiness of 6” and “This one has a degree of dirtiness of 3”. Even for the same picture of the same site, the rating should be different depending on the timing of acquisition. So the training data should include multiple images for a single job site.

We wanted to include specific images of the site as photos, but we could not find images of moderately dirty streets and parks. I hope you can imagine that the degree of dirtiness of an alley in Shibuya on a Monday morning is about 9.

Here, we are finally talking about data diversity.

The AI model uses the dirt in the image to determine the amount of trash. However, AI cannot understand the concept of dirt from the beginning, so only the information in the training data can provide clues for judging dirt as dirt.

Here, I would like to organize dirt in the cleaning field for a moment.

The first thing that comes to mind is dust on the ground and stains on railings and walls. Another kind of stain might be a postcard someone wrote on the wall late at night. Other trash, such as cans and cigarette butts scattered along the roadside. There are also stains that grow naturally, such as moss and algae, and the list is endless.

Each stain is then further sorted in detail in terms of size, shape, color, and location of appearance. For example, a stain like moss or algae may cover the entire floor, or it may adhere to only one part of a wall or stone. It may be green, but it could also be brownish.

Furthermore, it may be spread in a straight line or in a radial circle.

There are many different patterns as well with regard to other stains. All of those things add up to a diversity of data.


AI can make accurate decisions if the data is similar to the training data.

When AI model training is completed and unknown data that was not in the training data is input to AI, AI can make accurate judgments if similar data is available in the trained images, but it is likely to make incorrect judgments for data that is too far apart.

For example, in the previous example, there are various types of moss and algae stains in the training data in terms of size, shape, color, and location of appearance. In this case, the AI model would be able to accurately determine the degree of contamination from moss and algae stains.

However, suppose there is almost no training data on graffiti. The AI model may judge the graffiti as a minor stain, and the customer may complain that the graffiti is not removed no matter how long they wait, or on the other hand, the AI model may be overly sensitive to unimportant graffiti such as letters dug into the ground and send a cleaner.

Although this is a rather subjective measure, AI can make highly accurate judgments if there is a sufficient number of data for each type of object, divided by criteria such as type, size, shape, color, and location of occurrence. It is difficult to say how much is a sufficient number, as it depends on the nature of the data and the performance of the AI model.

For example, it is possible to evaluate brown moss and algae stains on the floor when only green moss and algae stains on the walls are available in the training data, but this requires more training data.


Often forgotten is the diversity of conditions under which data is acquired.

So far, we have summarized the diversity of objects that we want to detect or that can be used as clues for making decisions, but one diversity of data that is often forgotten is the diversity of conditions under which data is acquired.

In the example of the cleaning company above, the cleaning site where the images are taken could be a park, an exterior wall of a public facility, a parking lot, a road, and so on.

If this is the case in a park, it can be judged as not so dirty, but if it is dirty in a parking lot, it is very dirty, and so on. It is highly likely that the way dirt is captured will differ depending on the location. The AI’s judgment is also affected by objects that are not dirt, such as people, cars, or animals that happen to appear in the image.


If the data is highly diverse, test the effectiveness of AI by first limiting the number of issues

In this way, we can see that the AI project we have given as an example is a project with a high diversity of data.

Therefore, instead of introducing this system to all sites from the beginning, it may be necessary to devise a way to train AI models by focusing on parking lots where you are in charge of many sites, for example.

Furthermore, if the monitoring time is limited to noon only, for example, the timing of capturing images to be used as training data can be reduced.

Although the business merits are limited, we will focus on the functions that can achieve a certain degree of accuracy for the time being and proceed with actual operation.

While the system is in operation with limited functions, data for other sites will be collected, and the types of sites to be supported will be gradually expanded from parking lots to roads, parks, and so on.


Conclusion, it is important to sort out the diversity of data first


I have written a lot, but the first thing I said was that it is important to sort out the diversity of data.

However, it is difficult to know whether the data in front of you is data with high diversity or low diversity unless you are familiar with AI. In such cases, please feel free to contact ISID’s AITC. We will be happy to assist you based on examples of projects we have conducted in the past.