Decline in AI Training Data Availability

  • 0
  • 3167
Font size:
Print

Decline in AI Training Data Availability

Context:

According to a study by the Data Provenance Initiative, over the past year, many of the most important web sources used for training AI models have restricted the use of their data.

 

More on News:

  • Robots Exclusion Protocol: Restrictions are implemented through robots.txt files, limiting automated data collection.
  • C4 Dataset: Up to 45% of data is restricted by website terms of service.
  • Publishers and platforms like Reddit and StackOverflow are blocking data access or charging fees. 
    • Legal actions have been taken by some, such as The New York Times.
  • Data Deals: Some AI companies have secured agreements with major publishers (e.g., AP, News Corp) for continued data access.

 

Data Crisis:

  • Decline in consent for data usage affects researchers, academics, and non-commercial entities.
  • Only 5% of all data and 25% of data from top-quality sources are accessible for training AI.
  • The surge in generative AI has caused conflicts with data owners.
  • Publishers have implemented paywalls and altered terms of service to restrict data access.
  • Companies like OpenAI, Anthropic, and Google face restrictions from some companies blocking their web crawlers.
  • Smaller AI companies and academic researchers dependent on public data sets are experiencing difficulties.

 

Role of Data in AI:

  • AI systems depend on large volumes of data to train machine learning algorithms.
  • High-quality data (diverse, comprehensive, and relevant) enhances the accuracy of AI models.
  • AI systems have long relied on vast amounts of text, images, and videos from the internet for training.
  • Currently AI systems have traditionally used vast amounts of online data (text, images, videos) for training. This data is now becoming less accessible.
  • Types of Data:
      • Training Data: Used to train the AI model.
      • Test Data: Evaluate the model’s performance and compare it with other models.
      • Validation Data: Used to confirm the model’s accuracy.
  • Sources of Data:
    • Internal Data: Example includes customer data within organisations, used for specific AI applications like Spotify’s playlist recommendations.
    • External Data: Acquired from third-party vendors, open data sets (e.g., government, research institutions), or internet scrapers. Recent trends include vendors like Reddit charging for API access.

 

Impact on AI Development & Future Concerns:

  • Model Performance: High-quality data is crucial for AI models like ChatGPT, Gemini, and Claude. Restrictions are affecting their ability to improve. 
  • Access Challenges: AI companies, particularly smaller ones and researchers, face difficulties due to restricted data access and the need for licensing deals.
  • Data Wall: A potential “data wall” could emerge where accessible public data is exhausted and remaining data is behind paywalls or exclusive deals.
  • Synthetic Data: AI companies consider using synthetic data, but its quality as a replacement for real data is debated.

 

Call for New Solutions:

  • Control Tools: There’s a need for improved tools allowing website owners to manage how their data is used, distinguishing between commercial and non-commercial uses.
  • Lessons for AI Companies: The data access restrictions highlight the importance of valuing data sources and addressing consent issues to prevent data access from being further restricted.

 

NITI Aayog launched the National Data & Analytics Platform (NDAP) for open public use.

  • Objective: Democratise access to public government data by making it accessible, interoperable, interactive, and user-friendly.
  • Features: Hosts foundational datasets from various government agencies, presents them coherently, and provides tools for analytics and visualisation.

 

Share:
Print
Apply What You've Learned.
Previous Post Spatial Computing
Next Post ICJ opinions Israel's presence in the Palestinian occupied territories as unlawful
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x