Documenting data produced using ChatGPT in R Packages

Some thoughts on including sample data generated by ChatGPT as data in R packages
generative-ai
r-packages
Author

Cynthia Huang

Published

April 13, 2024

Modified

November 9, 2024

Often when writing functions or packages, you might need to create some sample data to test your functions or for users to play with. Before ChatGPT, I always wrote my sample data by hand, or hacked together some function to generate it.

Today, I used ChatGPT to generate sample events for a ggplot2 calendar extension I’ve been building for fun. It was surprisingly easy and effective to prompt the model to:

data-raw/demo-events-gpt.yml
- event_id: EVT001
  startDate: "2024-05-15"
  endDate: "2024-05-16"
  event_title: "TechFest 2024"
  event_descr: "TechFest 2024 brings together innovators, startups, and tech enthusiasts for two days of keynote speeches, panel discussions, and workshops covering the latest trends in technology, artificial intelligence, and blockchain."
  event_emoji: "🤖"
  event_link: "https://techfest2024.example.com"

- event_id: EVT002
  startDate: "2024-05-25"
  endDate: "2024-05-27"
  event_title: "Global Health Summit"
  event_descr: "The Global Health Summit convenes healthcare professionals, policymakers, and researchers from around the world to discuss pressing issues in public health, disease prevention, and healthcare accessibility."
  event_emoji: "🌍"
  event_link: "https://globalhealthsummit.example.com"

That’s all fine and good, but I wanted to include this data in my ggplot extension package, and it’s generally good practice to document your data and where it comes from. Of course, this is not a strict requirement, especially for demo datasets, but it got me thinking about when we should be “citing” generative AI. The R packages textbook provides pretty clear guidance on documenting data in R packages.

Here’s what I came up with for adapting that advice to cite ChatGPT as a data source:

R/data.R
#' 5 Sample Events generated by ChatGPT 3.5
#'
#' A set of 5 demo events generated by ChatGPT 3.5 on 13 Apr, 2024.
#' The following field descriptions were also generated
#' in the same session.
#'
#' @format
#' A data frame with 5 rows and 7 columns:
#' \describe{
#'   \item{event_id}{Unique identifier for the event.}
#'   \item{startDate}{Start date of the event.}
#'   \item{endDate}{End date of the event.}
#'   \item{event_title}{Title or name of the event.}
#'   \item{event_descr}{Description of the event.}
#'   \item{event_emoji}{Emoji representing the event theme or type.}
#'   \item{event_link}{URL link to the event's website or page.}
#' }
#'
#' @source https://chat.openai.com/share/c68b7a82-5378-45c8-bf41-7fa134f0b74a
"demo_events_gpt"

Citation

BibTeX citation:
@online{huang2024,
  author = {Huang, Cynthia},
  title = {Documenting Data Produced Using {ChatGPT} in {R} {Packages}},
  date = {2024-04-13},
  url = {https://www.cynthiahqy.com/posts/documenting-gpt-generated-data/},
  langid = {en}
}
For attribution, please cite this work as:
Huang, Cynthia. 2024. “Documenting Data Produced Using ChatGPT in R Packages.” April 13, 2024. https://www.cynthiahqy.com/posts/documenting-gpt-generated-data/.