Skip to content

0.2. Datasets

What is a dataset?

A dataset is an organized collection of data, serving as the cornerstone for any AI or Machine Learning (ML) initiative. The structure of a dataset might vary, yet its pivotal role in shaping the scope, capabilities, and challenges of a project remains undisputed. Data preparation, involving substantial cleaning and exploration of raw data, often consumes the lion's share of a Machine Learning engineer's efforts, reinforcing to the adage that 80% of the work pertains to data processing, leaving only 20% for modeling. Yet, this preparation phase is crucial, setting the stage for the subsequent modeling efforts.

The impact of a dataset's quality and size on the outcomes of a model is profound, frequently surpassing the effects of model adjustments. This impact is embedded into the "Garbage in, garbage out" concept related to data science projects.

When is the dataset used?

Datasets are integral throughout the AI/ML project lifecycle, playing pivotal roles in various stages:

  • Exploration: This phase involves delving into the dataset to find insights, study variable relationships, and discern patterns that could influence future predictions.
  • Data Processing: At this stage, the focus is on crafting features that encapsulate the predictive essence of the data and on partitioning the dataset effectively to gear up for the modeling phase.
  • Model Tuning: Here, the objective is to refine the model's hyperparameters through strategies like cross-validation to bolster the model's generalization capability.
  • Model Evaluation: This final step entails evaluating the model's performance on unseen data and identifying areas for potential improvement.

What are the types of datasets?

Datasets are typically classified into three categories: structured, unstructured, and semi-structured, each with distinctive features:

Structured Data

This category includes data that adhere to a clear schema, simplifying organization and analysis.

  • Tabular Data: Characterized by its rows-and-columns format, where each column holds data of a uniform type. This format is prevalent in CSV files and databases.
  • Time Series Data: Comprises sequential data points collected at consistent time intervals. The sequential order is crucial, as alterations can significantly impact the dataset's interpretation, making it vital for forecasting in sectors like finance and energy.
  • Geospatial Data: Relates to specific geographical locations or areas, critical for spatial analyses and often managed with Geographic Information Systems (GIS).
  • Graph Data: Consists of nodes (or vertices) and edges (or connections), representing entities and their interrelations. This data type is crucial for modeling complex networks and systems.

Unstructured Data

Unstructured data lacks a predefined format, which poses challenges in its processing and analysis.

  • Text: Ranges from short messages to extensive documents, marked by the complexity and diversity of language.
  • Multimedia: Includes images, audio, and video files, known for their high dimensionality and the complexities involved in extracting meaningful insights.

Semi-Structured Data

Positioned between structured and unstructured data, semi-structured data does not follow a strict schema but includes markers or tags to aid in identifying specific elements, making it more manageable. JSON and XML are notable examples.

Which dataset should you use?

The decision on which dataset to use often boils down to a balance between familiarity and exploration of new data. A simple rule of thumb is to opt for the dataset you are most acquainted with. Regardless of the diversity in data types and applications, the core principles of MLOps are applicable across various domains. Therefore, starting with a well-understood dataset allows you to concentrate on honing your MLOps skills rather than untangling the complexities of an unfamiliar dataset.

As mentioned in the previous section, the course uses the Bike Sharing Demand dataset dataset by default. You are free to use any other datasets, either for personal or professional purposes.