Machine learning is an application of artificial intelligence (AI), that provides system with an ability to automatically learn and improve from experience without being explicitly programmed. ML focuses on the development of computer programs that can access data and use them to learn for themselves. One of the most crucial aspects ML depends on is data.
On Dataleon we realize that data preparation is a very important step in the machine learning process. Basically, it includes a set of procedures of getting ready the data for training, testing and implementation of an algorithm. This multi-step process involves data collecting, cleaning, validation, transformation and labeling.
The process of data preparation starts with searching for the right data. It involves collecting the data that is believed to be useful in making a prediction and clearly defining the form the prediction will take. It may also embrace talking to project managers and other people with deep expertise in the domain. Deep understanding of our customer`s needs detemine the data we will later use for ML.
When gathering the data, the main problems we face are lack of data, poor quality data and unbalanced data. To solve these problems, Dataleon`s experts use Scenes Editor. It is an interface which is used for data generation. The output data can be later used for labeling.
Data cleaning is the next step in data preparation. On this step we remove all the data that do not belong to the dataset. This process involves fixing or removing incorrect, corrupted, incorrectly formated, duplicate, or incomplete data withing a dataset.
Data validation takes place after cleaning the data. On this step we check the data for:
- Validity. The degree to which the data conforms to defined
business rules or constraints.
- Accuracy. Ensure that your data is close to the true values.
- Completeness. The degree to which all required data is
- Consistency. Ensure that your data is consistent within the same
dataset and/or across multiple data sets.
- Uniformity. The degree to which the data is specified using
the same unit of measure.
Being on the data transformation stage, we convert data from one format or structure into another one. Transformation process can also be referred to as data wrangling, or data mungling, transforming and mapping data from one “raw” data form into another format for warehousing and analyzing.
The final stage of the data preparation process is data labeling. In machine learning it is the process of identifying raw data (images, text files, videos, etc.) and adding one or more meaningful and informative labels to provide context so that a machine learning model can learn from it. Dataleon`s experts use Labeling Editor for data labeling.
Dataleon API can guide you trough the whole process of data preparation. If you are interested in our services, just let us know.