0.2. Datasets

What is a dataset?

A dataset is a structured collection of data that forms the foundation of any AI or Machine Learning (ML) project. While its format can vary, the dataset fundamentally defines the scope, potential, and challenges of your work.

A common observation in the industry is that data preparation—including cleaning, exploring, and feature engineering—can consume up to 80% of an engineer's time, with only 20% left for modeling. This intensive preparation is non-negotiable, as the quality and size of the dataset have a more profound impact on model performance than the choice of algorithm itself. This reality is captured by the fundamental principle of data science: "Garbage in, garbage out".

When is a dataset used?

Datasets are central to every stage of the AI/ML project lifecycle:

Exploration: In this initial phase, you analyze the dataset to uncover insights, understand variable relationships, and identify patterns that can inform your modeling strategy.
Data Processing: Here, you transform raw data into a usable format by creating predictive features (feature engineering) and splitting the data into training, validation, and test sets.
Model Tuning: The dataset is used to optimize model hyperparameters, often through cross-validation, to ensure the model generalizes well to new, unseen data.
Model Evaluation: Finally, you assess the model's performance on a reserved test set to measure its effectiveness and identify opportunities for improvement.

What are the types of datasets?

Datasets are broadly classified into three categories, each presenting unique storage, processing, and modeling challenges.

Structured Data

This data adheres to a predefined schema, making it straightforward to organize, query, and analyze.

Tabular Data: Organized in rows and columns, like a spreadsheet or database table (e.g., CSV, Parquet). Each column has a specific data type (integer, string, etc.).
- Use Case Example: A CSV file containing customer information, with columns for customer_id, purchase_date, and amount_spent.
Time Series Data: A sequence of data points recorded at consistent time intervals. The order is critical for analysis and forecasting.
- Use Case Example: Daily stock prices or hourly temperature readings used to predict future values.
Geospatial Data: Contains location-based information, such as coordinates or postal codes, used for spatial analysis.
- Use Case Example: Mapping customer locations to optimize delivery routes.
Graph Data: Represents entities as nodes and their relationships as edges. Ideal for modeling complex networks.
- Use Case Example: A social network, where users are nodes and friendships are edges, used to recommend connections.

Unstructured Data

This data lacks a predefined model, making it more complex to process and analyze. It often requires specialized ML models.

Text: Any form of text data, from short tweets to long documents. Processed using Natural Language Processing (NLP) models.
- Use Case Example: Analyzing customer reviews to determine sentiment (positive, negative, neutral).
Multimedia: Includes images, audio, and video files. Processed using Computer Vision and Audio Processing models.
- Use Case Example: An image dataset of cats and dogs used to train an image classification model.

Semi-Structured Data

This data doesn't fit a rigid schema but contains tags or markers to separate semantic elements, offering more flexibility than structured data.

Use Case Example: JSON or XML files, which use key-value pairs and tags to organize data hierarchically, are common for web APIs and configuration files.

What defines a high-quality dataset?

A high-quality dataset is the prerequisite for a successful ML model. Key attributes include:

Accuracy: The data correctly reflects the real-world phenomena it represents. Inaccurate data leads to flawed models.
Completeness: There are no missing values for critical features. Missing data can bias models or require complex imputation techniques.
Consistency: The data is free of contradictions. For example, a customer's age should not decrease over time.
Relevance: The data contains features that are predictive of the target outcome. Irrelevant data adds noise and complexity.
Timeliness: The data is recent enough to be relevant to the problem. Outdated data can lead to models that perform poorly on current inputs.

Which dataset should you use?

When starting out, the best choice is the dataset you are most familiar with. The principles of MLOps are universal and apply across different data types and domains. By using a dataset you already understand, you can focus your energy on mastering the MLOps workflow rather than getting bogged down in the specifics of new data.

As mentioned in the previous section, this course defaults to the Bike Sharing Demand dataset. However, you are encouraged to apply the concepts learned here to any dataset relevant to your personal or professional goals.