Volunteer, Sofia Vakhutinsky, shares her summer with metadata
Sofia Vakhutinsky recently graduated from the University of Colorado Boulder with a BA in Geography and Economics and minors in Geology and Atmospheric and Ocean Science. In the fall, Sofia will start a master’s program in Atmospheric Science at the University of Washington. Sofia spent the summer sifting through metadata, which will be integrated into the MGDS Vehicle Dive Metadata archive.
Below is Sofia’s account of what she worked on this summer for NDSF:
This summer I was tasked with cleaning up and standardizing Alvin’s data so the vehicle’s metadata product could be made available to the public with the same quality and standard as is currently available from Sentry. My first objective for this project was to generate all the files I would need to source metadata from, after which I would be able to run a Python script to extract the needed parameters to populate a metadata summary for each cruise. Finally, this data was added to a shared master Alvin metadata spreadsheet intended to capture the full history of Alvin dives. Throughout this project, it was also imperative to document all data issues encountered, so as to provide context for how the metadata was generated.
One of my first tasks was generating files containing the ship’s navigational data for recovery and launch coordinates by parsing log files with Matlab scripts. This process, covering dives from 2024 back to mid-2016, took several weeks to complete due to troubleshooting and updating scripts to handle various issues. Next, I ran a Python script on each cruise to generate metadata summaries and integrate them into the master Alvin spreadsheet. This revealed more issues with missing files and changes in file naming conventions and structures, leading to the Python script being appropriately updated. As we continued investigating older cruises, some file types became unavailable, prompting us to treat Alvin’s data as different eras: 2024 to mid-2016 and 2016 to 2014, characterized by the difference in file types produced by the instruments used at the time. I created a separate Python script to handle this earlier era and added functionality to parse legacy dive summary spreadsheets for data gaps. Although this older era of data was less consistent and required further troubleshooting than the newer cruises, the metadata summaries were able to be generated, and added to the master spreadsheet, providing 10 years of standardized Alvin metadata to be made available for public research.