Skip to content

Index

Data preparation is essential for all aspects of GenAI. It spans from pre-training to fine-tuning to retrieval-augmented generation, embodying a multifaceted process that ensures the data is optimized for AI training and application. The journey of data preparation involves several critical steps, each tailored to enhance the quality and effectiveness of the data used.

Below is a clickable overview of the data preparation process:

graph LR
    style A fill:#b8e2f2,stroke:#1e81b0,stroke-width:2px
    style B fill:#f9edbe,stroke:#f0b429,stroke-width:2px
    style C fill:#f4b6c2,stroke:#ec407a,stroke-width:2px
    style D fill:#88d8b0,stroke:#26a69a,stroke-width:2px
    style E fill:#80cbc4,stroke:#00897b,stroke-width:2px
    style F fill:#ffcc80,stroke:#ffa726,stroke-width:2px


    A(Data Identification) --> B(Data Collection)
    A --> E(Simulation)
    B --> C(Data Selection and Filtering)
    C --> D(Data Augmentation)
    D --> DD(Formatting)
    DD --> F{Use}
    E --> F

    click A href "../gathering/index.html" "Data Identification"
    click B href "../gathering/scraping.html" "Data Collection"
    click C href "../preparation/selection.html" "Data Selection and Filtering"
    click D href "../augmentation/index.html" "Data Augmentation"
    click DD href "formatting.html" "Formatting"
    click E href "simulation.html" "Data Simulation"

Data Preparation Steps:

  1. Data Identification: The initial stage involves identifying the relevant data sources and types that are essential for training the AI model. This step is foundational, setting the stage for subsequent processes.

  2. Data Collection: Once the necessary data sources are identified, the next step is to collect data from these sources. This phase ensures a robust dataset that reflects the diversity and complexity of real-world scenarios. Learn more.

  3. Data Selection and Filtering: After collecting a substantial dataset, the selection and filtering process begins. This step involves refining the dataset, removing irrelevant, redundant, or low-quality data to ensure the efficiency and effectiveness of the training process. Learn more.

  4. Data Augmentation: To further enhance the dataset, data augmentation techniques are applied. This involves generating new data points from existing ones through various transformations, thereby increasing the quality, diversity and volume of the training data. Learn more.

  5. Formatting: Rewrites the data in a manner that is reduces token-usage or enables better results. Learn more

Following the links provided in the diagram, you can explore each component of the data preparation process in greater detail.