Often when writing functions or packages, you might need to create some sample data to test your functions or for users to play with. Before ChatGPT, I always wrote my sample data by hand, or hacked together some function to generate it.
Today, I used ChatGPT to generate sample events for a ggplot2 calendar extension I’ve been building for fun. It was surprisingly easy and effective to prompt the model to:
- write 5 YAML nodes with the structure shown below, and
 - mock up example links and a relevant emoji, and
 - to modify the events to be within the same 3-month window, and have durations varying between 1-5 days:
 
data-raw/demo-events-gpt.yml
- event_id: EVT001
  startDate: "2024-05-15"
  endDate: "2024-05-16"
  event_title: "TechFest 2024"
  event_descr: "TechFest 2024 brings together innovators, startups, and tech enthusiasts for two days of keynote speeches, panel discussions, and workshops covering the latest trends in technology, artificial intelligence, and blockchain."
  event_emoji: "🤖"
  event_link: "https://techfest2024.example.com"
- event_id: EVT002
  startDate: "2024-05-25"
  endDate: "2024-05-27"
  event_title: "Global Health Summit"
  event_descr: "The Global Health Summit convenes healthcare professionals, policymakers, and researchers from around the world to discuss pressing issues in public health, disease prevention, and healthcare accessibility."
  event_emoji: "🌍"
  event_link: "https://globalhealthsummit.example.com"
That’s all fine and good, but I wanted to include this data in my ggplot extension package, and it’s generally good practice to document your data and where it comes from. Of course, this is not a strict requirement, especially for demo datasets, but it got me thinking about when we should be “citing” generative AI. The R packages textbook provides pretty clear guidance on documenting data in R packages.
Here’s what I came up with for adapting that advice to cite ChatGPT as a data source:
- explicitly including 
gptin the data frame created from the above yaml file:demo_events_gpt - including a link to the Chat in 
data-raw/script and@sourcedocumentation tag - explicitly mentioning ChatGPT 3.5 in the data documentation
 
R/data.R
#' 5 Sample Events generated by ChatGPT 3.5
#'
#' A set of 5 demo events generated by ChatGPT 3.5 on 13 Apr, 2024.
#' The following field descriptions were also generated
#' in the same session.
#'
#' @format
#' A data frame with 5 rows and 7 columns:
#' \describe{
#'   \item{event_id}{Unique identifier for the event.}
#'   \item{startDate}{Start date of the event.}
#'   \item{endDate}{End date of the event.}
#'   \item{event_title}{Title or name of the event.}
#'   \item{event_descr}{Description of the event.}
#'   \item{event_emoji}{Emoji representing the event theme or type.}
#'   \item{event_link}{URL link to the event's website or page.}
#' }
#'
#' @source https://chat.openai.com/share/c68b7a82-5378-45c8-bf41-7fa134f0b74a
"demo_events_gpt"
Citation
@online{huang2024,
  author = {Huang, Cynthia},
  title = {Documenting Data Produced Using {ChatGPT} in {R} {Packages}},
  date = {2024-04-13},
  url = {https://www.cynthiahqy.com/posts/documenting-gpt-generated-data/},
  langid = {en}
}