Font size:
Print
Decline in AI Training Data Availability
Context:
According to a study by the Data Provenance Initiative, over the past year, many of the most important web sources used for training AI models have restricted the use of their data.
More on News:
- Robots Exclusion Protocol: Restrictions are implemented through robots.txt files, limiting automated data collection.
- C4 Dataset: Up to 45% of data is restricted by website terms of service.
- Publishers and platforms like Reddit and StackOverflow are blocking data access or charging fees.
- Legal actions have been taken by some, such as The New York Times.
- Data Deals: Some AI companies have secured agreements with major publishers (e.g., AP, News Corp) for continued data access.
Role of Data in AI:
- AI systems depend on large volumes of data to train machine learning algorithms.
- High-quality data (diverse, comprehensive, and relevant) enhances the accuracy of AI models.
- AI systems have long relied on vast amounts of text, images, and videos from the internet for training.
- Currently AI systems have traditionally used vast amounts of online data (text, images, videos) for training. This data is now becoming less accessible.
- Types of Data:
-
-
- Training Data: Used to train the AI model.
- Test Data: Evaluate the model’s performance and compare it with other models.
- Validation Data: Used to confirm the model’s accuracy.
-
- Sources of Data:
-
- Internal Data: Example includes customer data within organisations, used for specific AI applications like Spotify’s playlist recommendations.
- External Data: Acquired from third-party vendors, open data sets (e.g., government, research institutions), or internet scrapers. Recent trends include vendors like Reddit charging for API access.
Impact on AI Development & Future Concerns:
- Model Performance: High-quality data is crucial for AI models like ChatGPT, Gemini, and Claude. Restrictions are affecting their ability to improve.
- Access Challenges: AI companies, particularly smaller ones and researchers, face difficulties due to restricted data access and the need for licensing deals.
- Data Wall: A potential “data wall” could emerge where accessible public data is exhausted and remaining data is behind paywalls or exclusive deals.
- Synthetic Data: AI companies consider using synthetic data, but its quality as a replacement for real data is debated.
Call for New Solutions:
- Control Tools: There’s a need for improved tools allowing website owners to manage how their data is used, distinguishing between commercial and non-commercial uses.
- Lessons for AI Companies: The data access restrictions highlight the importance of valuing data sources and addressing consent issues to prevent data access from being further restricted.