K-Content News

Collecting Meaningful Data: CrowdWorks, a Data Collection Platform for Deep-Learning
  • December 26, 2019

Collecting Meaningful Data:
CrowdWorks, a Data Collection Platform for Deep-Learning

Many experts say that artificial intelligence (AI) will lead the future changes of the world. AI is based mostly on data, which means that AI can only function properly when there is a lot of meaningful data. CrowdWorks is a platform for deep learning training that helps companies collect the data they need.

(위드코카1 사진1)
By reporter Kim Tae-hwan, Money Today Network, kimthin@mtn.co.kr

The role is AI is expanding. AI helps humans focus on creative works by completing tasks that are too simple or too difficult for them. In order for AI systems to perform properly, they must be backed by high-quality data. Since AI learns and computes based on data, the more abundant and reliable the data, the more exact the results.

AI completes computations by mimicking different ways of human thinking. Computers are trained to check information and to study for themselves, much like humans; this method of learning is called machine learning. Computers are trained to learn with data given to them by humans. AlphaGo, a type of Go-playing AI, learned through machine learning in its early stage of development.

Deep learning is a more advanced form of machine learning. In the deep learning stage, AI learns through data and analyzes the data without any input from humans. For example, AI equipped with deep learning can “look at” tens of thousands of pictures and decide whether any of the pictures are of a dog. In its early stage, the AI system is given pictures of dogs by humans to learn the characteristics of a dog.

Korea’s First Data Crowdsourcing Platform

Highly accurate data is necessary for AI to efficiently perform deep learning. Using the previous example, if AI is trained using pictures of dogs, a higher learning effect can be achieved by using a larger number of pictures of dogs. Higher efficiency can be achieved by classifying pictures in advance into pictures of dogs and pictures of other things, that is, right and wrong answers. When pictures are classified in this way, the AI system can more quickly distinguish between right and wrong answers, which increases the system’s computation speed. In other words, the greater quantity of unstructured data the better for the AI system.

In years past, companies typically employed short-term, part-time workers to complete the classification of data for the development of AI systems. However, this caused several problems. Many of these part-time workers were college students who worked during their school vacation. By the time they became skilled at their work (usually after around three months), they had to return to school. Companies had to repeatedly hire new part-time employees, which led to increased management costs.

CrowdWorks has introduced a platform as a solution to management costs and performance management. The platform is a system that connects companies in need of data with people who can provide the data. For example, if Company A is developing an AI system that can distinguish between dogs and cats and needs images of dogs, the company can recruit workers who can collect images of dogs for them on the CrowdWorks website. Member workers can upload images of dogs and receive points for each image.

This type of platform is known as a ”data crowdsourcing platform.” Amazon, one of the world's largest IT companies launched “Mechanical Turk,” which was the world's first platform to provide these kinds of services. CrowdWorks was the first such platform to be launched in Korea.

This crowdsourcing method can be used not only for simple data collection (such as for the collection of animal images) but also for the collection of advanced data. For example, one CrowdWorks project involved the “collection of the sound of babies crying.”

New parents often do not know why their babies are crying. To solve this problem, CrowdWorks recruited data collectors who had infants less than six months old. The recruited workers were given devices to record the sounds of their babies crying and were asked to indicate the reason why their babies were crying. As more crying sounds were collected, the AI created to determine the reason for the crying became more accurate.

(위드코카1 사진2)

Accuracy of Korean AI: 70%

Data collection performed using the CrowdWorks platform has also been used to develop AI to block obscene messages or words used in online chatting or telephone counseling. Sometimes inappropriate messages are not conveyed directly using obscene words, but are more subtle. Recognizing this, CrowdWorks collected various types of obscene messages and used them to create AI that could block obscene messages.

On CrowdWorks, texts constitute the largest category of AI data, and there are various projects dealing with texts involving the collection of QA sets, production of summaries, and analysis of morphemes.

All collected data goes through an inspection process. Rewards are not paid for submissions that do not pass inspection. According to CrowdWorks, this inspection process raises the reliability of the submitted data to up to 99%. In contrast, Amazon’s Mechanical Turk does not have an inspection process. The inspection process needs to be performed by humans because the accuracy of the Korean AI used by CrowdWorks is relatively low.

Son Yu-i, a manager at CrowdWorks, said, “Currently, Korean AI can do tagging and labeling work, but it is only about 70% accurate. Humans still have to inspect the work.” She added, “In the past, Korea’s AI technology was at an infant level, but it has now developed to a middle or high school level.”

Son also commented, “Low-level AI technology requires high-quality data.” She continued saying, “Sadly, Korea’s AI technology is lower than those of the US and Europe, and we can only develop comparable services when we significantly raise the quality of our data.”

In order to raise the quality of the data, suitable data collection personnel must be recruited for each unique project. For example, doctors must be recruited to develop AI services that can read CT images to find tumors. For such projects, CrowdWorks separately recruits personnel working in the required field(s), and the recruited personnel gets paid higher rewards compared to general tasks.

(위드코카 1 사진3, 4, 5)

Screenshots of CrowdWorks’ platform app

AI and Contents

CrowdWorks, with the support of the Artificial Intelligence Research Institute (AIRI) and the Korea Creative Content Agency (KOCCA), manages the Intelligent Character Work and Service Model Development Project. This project aims to develop an “intelligent avatar” that can make certain judgments about a person and can react by expressing proper emotions. For example, a home AI robot should be able to make certain judgments about a child who is crying and be able to communicate and empathize with the child. If the AI robot gives too wide of a smile to a crying child, it can have the opposite effect of making the child even more upset.

An AI avatar follows three stages of processing: input, selection, and output. An AI avatar makes judgments based on a person’s emotions and the situation it is presented with, “inputs” related factors, “selects” an action to take, and then “outputs” the action. A huge amount of data must be collected to realize all the processes involved in AI. First, facial recognition must be collected according to various criteria, such as age and gender. Gestures that a person makes while speaking must also be analyzed so that the AI avatar can make accurate judgments about the person’s emotions. Additionally, if an AI robot uses gestures while speaking, it can make the robot seem more human.

Intelligent avatars that can communicate with humans, are expected to be very useful in the content field. They can be used to create an AI idol group or cyber entertainers that look very similar to humans.

Kim Ji-sun, Director of the Project Development Team at CrowdWorks commented, “It is good for us to have this new experience of managing the KOCCA’s R&D project.” This large-scale project involves the collection of a total of 20,000 pieces of facial data from 2,000 data providers. The project is the largest project ever undertaken by the company and is expected to contribute significantly to the company’s AI technology development.

INTERVIEW
Making Images that Do Not Exist Anywhere Else in the World

Kim Ji-sun, Director of the Project Development Team at CrowdWorks

How much participation is there on the CrowdWorks platform?

There are about 30,000 workers on the platform. When there is a simple project that must be completed within 24 hours, about 1,000 people instantly apply for the project. As of October 2019, 560 projects were registered on the platform, and we had 80 corporate customers that needed data.

How is duplicate data managed?

We block photos from being uploaded from an album. This helps minimize duplicate photos and ensure that the workers take the photos themselves. Data collected by each person is also inspected on a page to easily filter out duplicate data. We always try to come up with functions that can prevent workers’ carelessness and mistakes. Our company has applied for 34 patents related to the management of workers and inspectors, and five of them have been registered.

What kinds of companies need data?

Virtually all companies and schools that handle AI are CrowdWorks customers. We have worked for IT companies such as Naver and Kakao, telecom companies such as SK Telecom and KT, and credit card companies such as Hyundai Card. We also conduct various data projects with universities, such as Seoul National University and the Korea Advanced Institute of Science and Technology (KAIST).

What does the joint project with the KOCCA entail?

When we work with companies, we usually go no further than the data preprocessing stage. We have few opportunities to see how the data we generated is used. The KOCCA’s R&D project allows us to see how the data we generated is used, because it is managed jointly with the Artificial Intelligence Research Institute (AIRI), a supervising agency. The project has been very helpful and has given us new motivation.

Are there any AI technologies that can be applied in the content area?

There is a type of AI technology called generative adversarial networks (GANs) that can be used to combine existing data to create something new. For example, this technology can be used to create an image, such as a “talking Mona Lisa,” that does not exist in the real world. This technology can be used to create a piece of work that does not exist.

What kind of policy support is needed to stimulate AI research?

In order to promote deep learning, the government should ease data regulations. Currently in Korea, the Personal Information Protection Act is so stringent that data use is very limited. In contrast, Japan amended its Copyright Act in 2017 to allow the freer use of data for research purposes. AI research can be stimulated and advanced by the freer use of data in Korea.