Inspiration

  • Our solution was inspired by the understanding that the Journey North data and other community volunteer databases are based solely on data points generated from personal observations and the geographical locations of the individuals submitting them. Participants in these research studies are not randomly selected; they choose to participate voluntarily. This self-selection can lead to a non-representative sample, which may skew the data on monarch butterfly migration patterns. In addition to self-selection bias, these volunteer databases are also influenced by geographical and motivational biases. The geographic distribution of participants can affect the data collected, as certain areas may have more active observers than others. Furthermore, motivational factors, such as individual interest or concern for monarch conservation, can impact who chooses to participate, further contributing to the potential biases in the data.
  • For instance, the U.S. map graphic reveals higher concentrations of data points along the East and West Coasts. This disparity is primarily due to the larger citizen populations in these regions compared to the Midwest. Consequently, the Journey North campaign experiences fewer active users in the Midwest, leading to a skewed representation of monarch butterfly migration in these areas. As a result, the data collected does not accurately reflect the true migration patterns of monarch butterflies.

What it does

Python Script

  • This code automates the process of mapping towns and states (or provinces) to their respective counties using the OpenAI API. It begins by importing the necessary libraries (pandas for data manipulation and OpenAI for interacting with OpenAI's language model) and initializing an OpenAI client, assuming the API key is configured in the environment. The script reads a CSV file containing columns named 'Town' and 'State/Province' into a Pandas DataFrame.
  • A new column named 'County' is added to the DataFrame, initialized with None values to store the county names. The script iterates through each row of the DataFrame, extracting the 'Town' and 'State/Province' values and creating a prompt that instructs the OpenAI model to return only the name of the county corresponding to the specified town and state. This prompt is sent to the OpenAI API using the client.chat.completions.create() method, where the model (specified as "gpt-4o") responds with the appropriate county name. The county name is then extracted, cleaned using the strip() method to remove unnecessary whitespace, and stored back in the 'County' column of the DataFrame.
  • The script prints progress for each town-state pair processed, showing the corresponding county name. After completing all iterations, the updated DataFrame, which now includes the county names, is saved to a new CSV file without the index column. Finally, a message is displayed to confirm that the county names have been successfully added and saved to the specified file. This approach leverages the language model's understanding of geographic context to provide accurate county mappings, making it useful for automating tasks where no direct lookup table or API is available for this information.

How we built it

Python script made by Jo Cordova-Flores

  • The Python script was developed iteratively through trial and error. Initially, the approach involved importing the Geopy. library to infer county information by leveraging OpenAI’s capabilities. However, this method resulted in multiple errors throughout various parts of the code. To address these issues, I utilized CommandLineGPT to help identify and resolve the errors.
  • Upon running the code step-by-step, I discovered that the geolocator component frequently timed out. I created a smaller .csv file to test geolocation lookups for five different towns and states to validate this. As expected, the timeouts persisted. Recognizing this limitation, I decided to explore OpenAI's API for a simpler, more "human-like" response. After identifying a suitable skeleton for the final code, I modified it to query my hometown and successfully retrieved the correct county.
  • Encouraged by this success, I re-evaluated my approach and implemented a loop to automate the process. This loop was designed to repeatedly send the prompt, "Here is the town and state. Respond only with the county," for each location. I then used CommandLineGPT to refine the code further, enabling it to read town and state information from a .csv file in my folder. The code would query the county for each location and write the results into a new .csv file with an additional column containing the county names. Additionally, it would print out the current progress, indicating which town/state corresponded to which county, allowing us to spot-check the results by examining the .csv file.
  • Once the code was completed, it was transferred to Tyler Sheindlin's machine. There, he successfully imported the validated arrays into the system, which will be referenced and demonstrated in the following sections. Three instances of the Python code were executed in parallel to enhance processing speed and efficiency, significantly reducing the overall computation time required for data analysis from ~40 minutes to ~15 minutes. Exporting HTML into a CSV, followed by Excel
  • To analyze the migration data of monarch butterflies, we initiated the process by opening the Journey North Monarch Adult Sightings map. Using the "View Data" option, we displayed a table containing all data points specific to the U.S.
  • We accessed the webpage’s underlying structure through the browser’s developer console. By examining the "Elements" section, we verified that the chart was structured with a header and a table, confirming that the HTML format could extract the data.
  • A JavaScript code was then generated using ChatGPT, enabling us to successfully extract the data points from the HTML structure and export them into Microsoft Excel for further analysis.

CODAP graph made by Benjamin Fairbank

  • Initially, we converted the data table from the Journey North Monarch Adult Sightings into Excel format. Subsequently, we filtered the dataset to exclude all states and provinces not located in the U.S., then formatted it to locate its respective county by searching between town and state.
  • Following this, we converted the Excel file into CSV format. Once the data was in CSV format, we imported it into CODAP (Common Online Data Analysis Platform).
  • In CODAP, we repositioned the "County" column to the far left of the table. We then added a new column labeled "Boundaries," placing it immediately to the right of the "County" column. To establish the boundaries, we selected the "Edit Formula" option, followed by "Insert Function," and chose the "Lookup Function," specifically the lookupBoundary (boundary_set, boundary_key) function.
  • After the function was inserted, we modified the "boundary_set" to reference "US_county_boundaries" and updated the "boundary_key" to correspond with the "County" column.
  • Following the new function, we changed the “Boundaries” column attributes to reflect “boundary.” We repeated this step in the “Number” column and had the attribute reflect “numeric.”

CSV/Excel file data manipulation by Tyler Sheindlin

  • To input data for processing and analysis for GTP, we worked with CSV and Excel files by first scrubbing outliers to ensure clean and accurate datasets. Outliers were identified using by manually inspecting thresholds relevant to the study. Depending on the data integrity and relevance, these outliers were either corrected or removed.
  • Afterward, the data was organized into three separate groups. This organization allowed for efficient sorting and aggregation of results. Finally, the cleaned and organized datasets were split into smaller, manageable groups to facilitate parallel processing. This approach enabled more efficient data handling and reduced processing times, leveraging the power of GTP for simultaneous computation across multiple subsets of the data.
  • After the GTP processing, files were combined and added back into the main dataset. This dataset was then filtered accordingly for our statistical analysis by leveraging Excel functions (=AVERAGEIFS () and =SUMIFS()).

Statistical Analysis/Correlation Study by Jonathan Calzadillas Data Collection and Preparation

  • The first phase of the project involved gathering temperature data (in Fahrenheit) and air quality data (AQI), where AQI was categorized as follows: 0–49 (Good), 50–99 (Moderate), and greater than 100 (Poor). This data was collected by counties throughout the United States, along with the adult monarch butterfly population density, covering 2022 and 2023. Once the data was collected, an initial cleaning process was conducted. This involved retaining only key parameters: date, county, state, air quality, and temperature. Following this, further refinement was done by narrowing the dataset to include only temperature and AQI data from the months of August through October. The average AQI and temperature values were calculated for each selected county across different states for this period, forming the final synthesized dataset.

Analysis

  • With the cleaned and synthesized dataset, scatterplots were created to visualize the relationships between temperature and monarch butterfly population density, air quality (AQI), and monarch density for both 2022 and 2023. Linear regression lines were applied to these scatterplots to assess potential correlations further, and R² values were calculated.

Results

  • The analysis revealed a minimal linear correlation between both temperature and air quality with monarch butterfly population density. This was indicated by the low R² values across both years, suggesting that neither temperature nor air quality are strong predictors of monarch butterfly density in the observed data.

Challenges we ran into

  • We encountered numerous errors during the challenge, ranging from technical issues to personal challenges.
  • Personal challenges often reflected our varying strengths and weaknesses. Although our skills improved over time, delegating tasks to team members proved to be a challenge. Some team members possessed advanced knowledge of computers, data analysis, and programming, allowing them to handle more complex tasks. Conversely, others had limitations in their technical skills and found themselves assigned to more tedious tasks that, while less demanding, still contributed to overall progress.
  • Data scrubbing and file management presented significant hurdles throughout this process. Managing data within the team became a struggle starting with a hard drive filled with numerous files. Renaming files to ensure proper specificity and making copies at every opportunity resulted in a chaotic task that complicated our workflow.
  • Throughout the project, we consistently worked to troubleshoot programs such as Excel/CSV, Python, and CODAP, grappling with error codes, broken data, and specific formula requirements. This involved identifying and addressing technical issues as they arose and collaborating closely as a team to share insights and strategies for overcoming obstacles while also reaching out to mentors and others to help with issues. For instance, in Excel, we spent considerable time diagnosing issues with formula errors that impacted our data analysis, requiring us to check each cell for inconsistencies meticulously. In Python, we encountered numerous syntax errors and bugs in our scripts that hindered our ability to process data effectively, prompting us to seek solutions through online forums and documentation. Similarly, while working with CODAP, we faced challenges in importing and visualizing our datasets, which necessitated a deep dive into the platform’s functionalities. This collective effort not only improved our technical skills but also fostered a sense of teamwork and resilience as we navigated the complexities of each program.

Accomplishments that we're proud of

  • We are proud of staying resilient and not giving up during this challenge. Whether it was troubleshooting the Python formula for six hours, conforming state towns into counties for four hours, or asking numerous clarification questions, we maintained high morale, made significant progress, and ultimately reached our end goal. We take pride in our ability to actively engage in workshops while continuously advancing our efforts in the challenge. Although this Datathon presents its challenges, we recognize the importance of allowing our team members to attend available workshops further to enhance their knowledge and skillset in data analysis. This commitment to learning fosters individual growth and strengthens our team’s overall capabilities, enabling us to tackle complex problems with greater confidence and creativity in our future careers.

What we learned

  • We learned to use MySQL and Python to analyze and synthesize data while mastering Excel to effectively manipulate and synthesize large datasets.
  • We learned the importance of patience when addressing persistent challenges in an unfamiliar programming language. This experience highlighted the need to consider different perspectives while writing code, especially in handling exceptions and ensuring that the script runs smoothly without unexpected stops or failures. Additionally, we developed a deeper understanding of how to query Large Language Models (LLMs) effectively. We discovered that even slight changes in the wording of questions can significantly impact the responses received. This experience has deepened my appreciation for data science professionals who regularly interact with LLMs. Evidently, the learning process is ongoing, and everyone is striving to optimize these tools.
    • We learned how to convert between an Excel and CSV file.
  • We learned how to use CODAP to create a chart with numerical values, and then, based on this chart, we developed a formula that improved specific columns to input all data from the table into the U.S. state map.
  • We acquired the skills to utilize AI, particularly ChatGPT, effectively and accurately to generate formulas and solutions that accelerated our progress in completing the challenge.
  • We gained skills in troubleshooting problems related to formulas, tables, graphics, and other resources.
  • We learned to leverage each other's strengths and weaknesses to distribute the workload more effectively, ensuring a balanced work distribution that facilitates timely completion.
  • This experience taught us the importance of patience when tackling persistent challenges in a programming language not frequently. It also taught us to consider different perspectives when building code, particularly handling exceptions and ensuring the script runs smoothly without unexpected stops or failures. Additionally, we gained a deeper understanding of how to query Large Language Models (LLMs) effectively. Realizing that even small changes in the wording of questions can significantly impact the responses received. This experience has given us a greater appreciation for those working in Data Science who regularly engage with LLMs, as it’s clear that the learning process is ongoing for everyone striving to optimize these tools.

What's next for Monarch Butterflies: Monitoring Migration & Population

  • Through extensive research of peer-reviewed literature and resources from government, state, and global organizations, we developed a plan to utilize satellite imaging technologies from agencies such as NASA, the U.S. Space Force, and the U.S. Geological Survey (USGS). This innovative approach monitors monarch migration patterns and effectively conserves their habitats. These technologies provide detailed, real-time data on land use, vegetation changes, and environmental conditions, which are essential for assessing the availability of critical habitats like milkweed fields and overwintering sites. High-resolution imagery from NASA's Landsat satellites and USGS Earth observation tools can track deforestation, agricultural expansion, and the impacts of climate change on breeding and migratory areas.
  • To implement this strategy, we propose developing a collaborative plan with these agencies to establish a monitoring system that focuses on key regions, such as the monarch's migration corridors in the U.S. and overwintering forests in Mexico and California. This partnership would enable conservationists to identify habitat loss early and guide restoration efforts, including targeted milkweed planting and the protection of overwintering sites.
  • An interagency task force should be formed to execute this partnership, consisting of representatives from NASA, the U.S. Space Force, USGS, the U.S. Fish and Wildlife Service (USFWS), and relevant conservation groups and communities. This task force would define objectives, such as identifying and monitoring critical monarch habitats, and establish the technological requirements for satellite imaging, data collection, and analysis.
  • The next step involves designing a centralized data-sharing platform managed by the USFWS in collaboration with conservation experts. This platform would integrate satellite imagery with existing environmental data on monarch populations and habitats. NASA and USGS would supply high-resolution imagery and environmental data, while the Space Force could enhance surveillance capabilities in key regions, particularly during monarch migration seasons. The processed data would help detect habitat changes, including deforestation, land conversion, and extreme weather events that threaten monarch breeding and migratory sites.
  • Once the infrastructure is established, field teams from conservation organizations and local governments can use satellite data to prioritize intervention areas, such as replanting milkweed or protecting overwintering forests. Rapid responses to emerging threats will enable targeted and efficient on-the-ground efforts. Regular interagency meetings would ensure coordination and adaptive management based on the latest data. Public engagement could be fostered through open-access data and collaboration with citizen scientists monitoring local monarch populations.
  • Finally, securing long-term funding and resources is essential and achievable through applications for government conservation grants and private-sector support. Regular evaluations would assess the effectiveness of the satellite-based monitoring system, allowing for adjustments as technology advances. This collaboration would create a dynamic, data-driven approach to saving the monarch population by leveraging multiple federal agencies' combined resources and expertise.

Built With

Share this project:

Updates