Category: Archives

KBOO’s Edit-a-Thon and Reunion: Sustainable archiving through outreach in a community radio station

KBOO Community Radio has nearly 50 years of radio broadcast audio that represents news and perspectives less commonly heard on mainstream media. Its collection has been developed through individuals’ personal responsibility and action. Older recordings exist because they were recorded from on-air broadcast by KBOO’s long time news and public affairs director, or by board operators during shows. When I work with the older materials, I have questions that go unanswered by my contemporaries but are necessary to document the content, date it, and know who created it. I wanted to get volunteers in a room so that we could find answers to these questions!

17523302_10100230737557174_4153769955155573908_n.jpg

Photo courtesy of Sylvia Podwika

KBOO is a community radio station, which means volunteers run the station, with long-time volunteer hosts producing shows, training new volunteers, and being part of station committees. However, there wasn’t an existing cadre of archives volunteers. I developed tools and workflows for digital preservation and recruited two MLIS students to help me get the project up and running, but there still is no guarantee that this work will continue. The sustainability of archiving projects requires visibility and understanding in order to maintain levels of support in staffing, resources, and technical infrastructure for continued work. To increase visibility and understanding of current archives work, I combined archives outreach with volunteer engagement and organized KBOO’s Archives Database Edit-a-Thon and Reunion event. The goal was to have past KBOO hosts come back and share their knowledge of KBOO news and public affairs programming while volunteers who value documentation (librarians and archivists) help get these details into our archives management database.

17424824_10100231196043364_2788141081686814_n.jpg

Photo courtesy of Sylvia Podwika

I planned two structured activities, and also knew that we wanted to ask specific attendees targeted questions about the content they were familiar with. One of the activities was a fill-in-the blanks style task to decipher shorthand and written notes on the audio cassette to fill in specific fields in the archive database such as title, description, date, date type, contributor, and written notes on casing. Volunteers could also take photos of the casing to add to the database, or listen to audio. The second activity was listening to mystery content, i.e. a speaker at an event, with no speaker name, location, or date recorded. Some of these examples seem almost impossible but are useful to educate people on why certain metadata fields are important for search and discovery. And if someone recognizes a voice, then the mystery can be solved. It takes the right combination of people with unique knowledge to identify unique items. Our database supports “proposed changes” so that people didn’t have to be highly trained on metadata requirements. I gave basic instructions and knew that any proposed change could be understood in post-event editing. The information received, once reviewed, becomes part of the record of our station’s content.

Results? It worked! We found that success was had by:

  • Clearly defining outcomes and expectations
  • Staying flexible for people to work on their own, at their own pace, or with help
  • KBOO being KBOO: there is already a spirit of volunteerism in this community
  • Having KBOO people from its different decades available to provide insight: the 1960s, 1970s, 1980s, 1990s through current day hosts.
  • Having dedicated helpers
  • Donuts and coffee, pizza and seltzer water

Outcomes

  • People had fun!
  • Mysteries were solved!
  • There were reunion moments: hugs and photos
  • Everyone learned more about what KBOO is doing with their news and public affairs audio collection and contributed in a hands-on way
  • Individuals asked how they could get more involved with archives work

An archives event like this is replicable and valuable for other public broadcasting stations or archives. Each station will have to define outcomes and make a pathway for success for themselves and their participants based on how their archive is set up. For our Edit-a-Thon event, volunteers responded to making tangible archives contributions and understanding the real day-to-day work of archiving.

Lessons learned
Tech preparation was important. Our database is on our network server, inaccessible online and our largest event space doesn’t have computers in it. We bought a wifi access point and had our IT contractor set up a temporary secure wireless internet address so that participants could work with the database for the day. Receiving RSVPs of people with laptops was important, and having extra laptops was important. Even so, we didn’t have enough computer spaces and had to start showing people how to do the work on computers in the hallways, which took away from the intended group feel of the event.

IMG_20170325_121835.jpg

We referenced old listener guides, too

When promoting the event, it was difficult to describe an edit-a-thon, which is not surprising considering that the work I’m doing doesn’t come into the view of what most people think or know about archives. I had flyers, blurbs in newsletters, and on-air promos but still had people contact me for clarification. The short, engaging description of the event could be better.

People didn’t just come. They were asked multiple times by email and in person, called directly, and may have felt an obligation to come. Recruiting the right people was the most important part of this event and took the most time. I worked with KBOO’s volunteer coordinator and only contacted people who had left on good terms. KBOO’s program director also had suggestions of people who would hold a great amount of knowledge, such as former station managers. I also targeted individuals whose content was in the database but with important information missing. Although it was appropriate for the general public, participants self-selected based on their existing interest in archives: I asked for help posting in Northwest and Portland area archives calendars. Although I was only expecting library/archivist types and former KBOO staff and volunteers, we had a radio listener come in as well. We had 21 RSVP-ed participants for the 6 hour event, which we broke into two shifts: 10am-1pm and 1pm-4pm. Three of us were dedicated helpers/planners. Some people stayed for more than one shift, and there were additional people who stopped by out of curiosity with whom I was able to talk about archives and invite to buddy up with event participants doing work.

IMG_20170325_163201 (1).jpg

What is CFH and MAK? Some mysteries solved

We didn’t schedule in breaks. Breaks would definitely be a requirement in future event plans so facilitators could be reminded to take a moment. It took a lot of energy to be “on” for over six hours. In our debrief, we talked about the benefit of planned share-out sessions for the entire group in future events, where we collectively would take a break to share some fun discoveries.

There were definitely one-time participants as well, but KBOO now has new archives volunteers (and new-old volunteers) eager to sustain archives work.

Archive work is not known, yet the sustainability of archiving projects requires visibility and understanding in order to maintain levels of support in staffing, resources, and technical infrastructure for continued work.

IMG_20170325_134204.jpg

Advertisements

Moving Beyond the Allegory of the Lone Digital Archivist (& my day of Windows scripting at KBOO)

The “lone arranger” was a term I learned in my library sciences degree program and I accepted it. I visualized hard-working, problem-solving solo archivists in small staff-situations challenged with organizing, preserving, and providing access to the growing volumes of historically and culturally relevant materials that could be used by researchers. As much as the archives profession is about facilitating a deep relationship between researchers and records, this term described professionals, myself to be one of them, working alone and with known limitations. This reality has encouraged archivists without a team to band together and be creative about forming networks of professional support. The Society of American Archivists (SAA) has organized support for lone arrangers since 1999, and now has a full-fledged Roundtable for professionals to meet and discuss their challenges. Similarly, support for the lone digital archivist was the topic of a presentation I heard at the recent 2017 Code4Lib conference held at UCLA by Elvia Arroyo-Ramirez, Kelly Bolding, and Faith Charlton of Princeton University.

Managing the digital record is a challenge that requires more attention, knowledge sharing, and training in the profession. At Code4Lib, digital archivists talked about how archivists in their teams did not know how to process born-digital works, that this was a challenge, but more than that unacceptable in this day and age. It was pointed out that our degree programs didn’t offer the same support for digital archiving compared to processing archival manuscripts and other ephemera. The NDSR program aims close the gap on digital archiving and preservation, and the SAA has a Digital Archives Specialist credential program, but technology training in libraries and archives shouldn’t be limited to the few who are motivated to seek out this training. Many jobs for archivists will be in a small or medium-sized organizations, and we argued that processing born-digital works should always be considered part of archival responsibilities. Again, this was a conversation among proponents of digital archives work, and I recognize that it excludes many other thoughts and perspectives. The discussion would be more fruitful by including individuals who may feel there is a block to their learning and development to process born-digital records, and to focus the discussion on learning how to break down these barriers.

Code4Lib sessions (http://bit.ly/d-team-values, http://scottwhyoung.com/talks/participatory-design-code4lib-2017/) reinforced values of the library and archives profession, namely advocacy and empowering users. No matter how specialized an archival process is, digital or not, there is always a need to be able to talk about the work to people who know very little about archiving, whether they are stakeholders, potential funders, community members, or new team members. Advocacy is usually associated with external relations, but is an approach that can be taken when introducing colleagues to technology skills within our library and archives teams. Many sessions at Code4Lib were highly technical, yet the conversation always circled back to helping the users and staying in touch with humanity. When I say highly technical, I do not mean “scary.” Another session reminded us that technology can often cause anxiety, and can be misinterpreted as something that can solve all problems. When we talk to people, we should let them know what technology can do, and what it can’t do. The reality is that technology knowledge is attainable and shouldn’t be feared. It cannot solve all work challenges but having a new skill set and understanding of technology can help us reach some solutions. It can be a holistic process as well. The framing of challenges is a human-defined model, and finding ways to meet the challenges will also be human driven. People will always brainstorm their best solutions with the tools and knowledge they have available to them—so let’s add digital archiving and preservation tools and knowledge to the mix.

And the Windows scripting part?

I was originally going to write about my checksum validation process on Windows, without Python, and then I went to Code4Lib which was inspiring and thought-provoking. In the distributed cohort model I am a lone archivist if you frame your perspective around my host organization. But, I primarily draw my knowledge from my awesome cohort members and my growing professional network I connected with on Twitter (Who knew? Not me.). So I am not a lone archivist in this expanded view. When I was challenged to validate a large number of checksums without the ability to install new programs to my work computer, I asked for help from my colleagues. So below is my abridged process where you can discover how I was helped through an unknown process with a workable solution using not only my ideas, but ideas from my colleagues. Or scroll all the way down for “Just the solution.”

KBOO recently received files back from a vendor who digitized some of our open-reel content. Hooray! Like any good post-digitization work, ours had to start with verification of the files, and this meant validating checksum hash values. Follow me on my journey through my day of Powershell and Windows command line.

Our deliverables included a preservation wav, mezzanine wav, and mp3 access file, plus related jpgs of the items, an xml file, and a md5 sidecar for each audio file. The audio filenames followed our filenaming convention which was designated in advance, and files related to a physical item were in a folder with the same naming convention.

Md5deep can verify file hashes with two reports created with the program, but I had to make some changes to the format of the checksum data before they could be compared.

Can md5deep run recursively through folders? Yes, and it can recursively compare everything in a directory (and subdirectories) against a manifest.

Can md5deep selectively run on just .wav files? Not that I know of, so I’ll ask some people.

Twitter & Slack inquiry: Hey, do you have a batch process that runs on designated files recursively?

Response: You’d have to employ some additional software or commands like [some unix example]

@Nkrabben: Windows or Unix? How about Checksumthing?

Me: Windows, and I can’t install new programs, including Python at the moment

@private_zero: hey! I’ve done something similar but not on Windows. But, try this Powershell script that combines all sidecar files into one text file. And by the way, remember to sort the lines in the file so they match the sort of the other file you’re comparing it to.

Me: Awesome! When I make adjustments for my particular situation, it works like a charm. Can powershell scripts be given a clickable icon to run easily like windows batch files in my work setup where I can’t install new things?

Answer: Don’t know… [Update: create a file with extension .ps1 and call that file from a .bat file]

@kieranjol: Hey! If you run this md5deep command it should run just on wav files.

Me: Hm, tried it but doesn’t seem like md5deep is set up to run with that combination of Windows parameters.

@private_zero: I tried running a command, seems like md5deep works recursively but not picking out just wav files. Additional filter needed.

My afternoon of Powershell and command line: Research on FC (file compare), sort, and ways to remove characters in text files (the vendor file had an asterisk in front of every file name in their sidecar files that needed to be removed to match the output of an md5deep report).

??? moments:

Turns out using powershell forces output as UTF-8 BOM as compared to ascii/’plain’ utf output of md5deep text files. Needed to be resolved before comparing files.

The md5deep output that I created listed names only and not paths, but that left space characters at the end of lines! That needed to be stripped out before comparing files.

I tried to perform the same function of the powershell script in windows command line but was hitting walls so went ahead with my solution of mixing powershell and command line commands.

After I got 6 individual commands to run, I combined the Powershell ones and the Windows command line ones, and here is my process for validating checksums:

Just the solution:

It’s messy, yes, and there are better and cleaner ways to do this! I recently learned about this shell scripting guide that advocates for versioning, code reviews, continuous integration, static code analysis, and testing of shell scripts. https://dev.to/thiht/shell-scripts-matter

Create one big list of md5 hashes from vendor’s individual sidecar files using Powershell
–only include the preservation wav md5 sidecar files, look for them recursively through the directory structure, then sort them alphabetically. The combined file is named mediapreserve_20170302.txt. Remove the asterisk (vendor formatting) so that the text file matches the format of an md5deep output file. After removing asterisk, the vendor md5 hash values will be in the vendormd5edited.txt file.

open powershell

nav to new temp folder with vendor files

dir .\* -exclude *_mezz.wav.md5,*.xml,*.mp3, *.mp3.md5,*.wav,*_mezz.wav,*.jpg,*.txt,*.bat -rec | gc | out-file -Encoding ASCII .\vendormd5.txt; Get-ChildItem -Recurse A:\mediapreserve_20170302 -Exclude *_mezz.wav.md5,*.xml,*.mp3, *.mp3.md5,*.wav.md5,*_mezz.wav,*.jpg,*.bat,*.txt | where { !$_.PSisContainer } | Sort-Object name | Select FullName | ft -hidetableheaders | Out-File -Encoding “UTF8” A:\mediapreserve_20170302\mediapreserve_20170302.txt; (Get-Content A:\mediapreserve_20170302\vendormd5.txt) | ForEach-Object { $_ -replace ‘\*’ } | set-content -encoding ascii A:\mediapreserve_20170302\vendormd5edited.txt

Create my md5 hashes to compare to vendor’s
–run md5deep on txt list of wav files from inside the temp folder using Windows command line (Will take a long time to hash multiple wav files)

“A:\md5deep-4.3\md5deep.exe” -ebf mediapreserve_20170302.txt >> md5.txt

Within my new md5 value list text file, sort my md5 hashes alphabetically and trim the end space characters to match the format of the vendor checksum file. Then, compare my text file containing hashes with the file containing vendor hashes.
–I put in pauses to make sure the previous commands completed, and so I could follow the order of commands.

run combined-commands.bat batch file (which includes):

sort md5.txt /+34 /o md5sorteddata.txt

timeout /t 2 /nobreak

@echo off > md5sorteddata_1.txt & setLocal enableDELAYedeXpansioN
for /f “tokens=1* delims=]” %%a in (‘find /N /V “” ^<md5sorteddata.txt’) do ( SET “str=%%b” for /l %%i in (1,1,100) do if “!str:~-1!”==” ” set “str=!str:~0,-1!” >>md5sorteddata_1.txt SET /P “l=!str!”>md5sorteddata_1.txt echo.
)

timeout /t 5 /nobreak

fc /c A:\mediapreserve_20170302\vendormd5edited.txt A:\mediapreserve_20170302\md5sorteddata_1.txt
pause

The two files are the same, so all data within it matches, therefore, all checksums match. So, we’ve verified the integrity and authenticity of files transferred successfully to our server from the vendor.

Current digital program production practices at KBOO

Things are always in flux at KBOO, many times in order to improve the station. The digital program production practices change with newer software, hardware, or workflows to meet the constantly evolving standards and support KBOO’s radio programming into the future. For that reason, this brief diagram of current digital program production practices of live programs (i.e. how radio programming moves around, gets broadcast and archived) reflects the flow as of December 2016 with known changes coming down the road.

Click to view document

screen-shot-2016-12-22-at-7-02-40-pm

What is metadata?

This is something I wrote for KBOO Community Radio volunteer hosts and programmers to allow understanding of how metadata is used in their digital preservation and archiving work, as part of the AAPB NDSR program.

Erin mentioned that the KBOO community has questions as to what metadata is. Simply put,

Metadata is a set of data that describes and gives information about other data.

metadataMetadata is information. That cup on the table? It’s red, it’s ceramic, it belongs to Alex, it was bought last week but was made long before that, when, we aren’t sure. These are all things that provide information about the cup. People who enter and manage metadata document the most important pieces of information about an item, depending on who they perceive will be looking for or learning about the item. At KBOO, like other libraries and archives, metadata is entered and structured in a specific way to be human and machine readable.

A human readable piece of metadata is a notes field that combines all the descriptive information we just discovered about that cup. “This cup is red, made of ceramic, it belongs to Alex, it was bought last week but was made long before that, when we aren’t sure.”

A machine can’t understand the contents of that notes field. How does metadata become machine readable? The first step is to follow a metadata schema. In the field of information science, different metadata schemas have been developed, each specific to certain kinds of data. Schemas provide definitions and meanings to metadata fields. At KBOO, the PBCore metadata schema is very useful. It was developed specifically for organizing and managing public broadcasting data elements, and makes sense for time-based media like audio and video. People managing information about print books don’t need metadata fields for duration or generations. People managing audio metadata would want to document the duration of the content, and whether the content is the edited version or broadcast version with promos: these fields are included in PBCore.

metadata-image-stevenjmiller

Image by Stephen J. Miller; I added PBCore to the mix. His web page on metadata resources is also excellent: https://people.uwm.edu/mll/metadata-resources/

PBCore is a national metadata standard. Specific definitions and rules ensure that content entered by different institutions in the same way can be shared. PBCore data held in an  XML document allows data to be transmitted across a number of fields and disciplines. It is both human and machine readable. In the absence of XML, many organizations use spreadsheets that can be transformed or edited to work with various systems.

It is KBOO’s intent, at the end of the NDSR program, to upload a first batch of audio content and metadata to the American Archive of Public Broadcasting (AAPB). The AAPB’s metadata management system is a complex hierarchical database, and data must be formatted in a specific way. When metadata fields for KBOO’s audio metadata are formatted using the PBCore schema, multiple records can be uploaded into the AAPB’s system in a csv file. This takes advantage of the machine-readable definitions in the schema. If the metadata was not machine readable, a person would enter information manually: each metadata value for every field, for every record. Using standards and metadata schemas allows an archivist to let computers do the heavy-lifting.

Here are some examples of PBCore metadata fields:

pbcoreTitle is a name or label relevant to the asset.

Best practices: There may be many types of titles an asset may have, such as a series title, episode title, segment title, or project title, therefore the element is repeatable. Usage: required, repeatable.

Sensible and understandable, right? Here’s another:

essenceTrackSamplingRate measures how often data is sampled when information from the audio portion from an instantiation is digitized. For a digital audio signal, the sampling rate is measured in kilohertz and is an indicator of the perceived playback quality of the media item (the higher the sampling rate, the greater the fidelity). Usage: optional, not repeatable.

If you are a KBOO program host, you have encountered sample rate without knowing it. The autoarchive mp3 file that magically shows up on your episode page is encoded at a sample rate of 44.1 kHz. Audio archives keep track of the quality and type of digital files it has collected. Digitizing from open reel guidelines are 96kHz sample rate at a bit depth of 24.

Right now I’m keeping 43 fields of information for each audio item. A handful of these fields will become defunct, once the data in them is reviewed and unique metadata is moved into more appropriate fields. Erin and I decided that we would require a minimum of six fields for physical items: unique identifier, title, date, date type, format, and rights statement. With this minimum amount of information, KBOO would be able to know what an item is and what it can do with it. All metadata fields are important, but if we made all of them required, it would slow down the cataloging process due to unknown information or long research periods. Examples of non-required fields are: publisher, subject, contributor names. Thirteen fields relate to the digital object, once it becomes created. The AAPB’s system requires 13 metadata fields to be filled.

There is so much to discuss about metadata, if you have any questions, you can email them to me at selena@kboo.org or tweet them at @selena_sjsu.

Ruby & Nokogiri for Webscraping

As part of the AAPB NDSR fellowship, we residents are afforded personal professional development time to identify and learn or practice skills that will help us in our careers. This is a benefit that I truly appreciate!

ruby-language

Ruby… the jewels theme continues from Perl

I had heard of the many programming and scripting languages like Perl, Python, Ruby, and bash but really didn’t know of examples that showed their power, or applicability. In the past month, I’ve learned that knowing the basics of programming will be essential in data curation, digital humanities, and all around data transformation and management. Librarians and archivists are aware of the vast amount of data: why not utilize programming to automate data handling? Data transformation is useful for mapping and preparing data between different systems, or just getting data in a clearer, easier to read format than what you are presented with.

At the Collections as Data SymposiumHarriett Green presented an update on the HTRC Digging Deeper, Reaching Further Project [presentation slides]. This project provides tools for librarians by creating a curriculum that includes programming in Python. This past spring, the workshops were piloted at Indiana University and University of Illinois–my alma mater :0). I can’t wait until the curriculum becomes more widely available so more librarians and archivists know the power of programming!

Screen Shot 2016-10-17 at 1.32.04 PM.png

But even before the symposium, my NDSR cohort had been chatting about the amounts of data we need to wrangle, manage, and clean up in our different host sites. Lorena and Ashley had a Twitter conversation on Ruby that I caught wind of.  Because of my current project at KBOO, I was interested in webscraping to collect data presented on HTML pages. Webscraping can be achieved by both Python and Ruby. My arbitrary decision to learn Ruby over Python is probably based on the gentle, satisfying sound of the name. I was told that Python is the necessary language for natural language processing. But since my needs were focused on parsing html, and a separate interest in learning how Ruby on Rails functioned, I went with Ruby. Webscraping requires an understanding of the command line and HTML.

  • I installed Ruby on my Mac with Homebrew.
  • I installed Nokogiri.
  • I followed a few tutorials before I realized I had to read more about the fundamentals of the language.

I started with this online tutorial, but there are many other readings to get started in Ruby. Learning the fundamentals of the Ruby language included hands-on programming and following basic examples. After learning the basics of the language, it was time for me to start thinking in the logic of Ruby to compose my script.

Screen Shot 2016-10-17 at 5.25.23 PM.png

From: http://ruby.bastardsbook.com/chapters/collections/
Pieces of data are considered objects in Ruby. These objects can be pushed into an array, or set of data.

As a newbie Ruby programmer, I learned that there is a lot I don’t know, there are better and more sophisticated ways to program if I know more, but I can get results now while learning along the way. For example, another way data sets can be manipulated in Ruby is by creating a hash of values. I decided to keep going with the array in my example.

Screen Shot 2016-10-17 at 5.32.55 PM.png

http://kboo.fm/program lists 151 current on-air programs. I want to see a compact list, in addition to specific program information on the pages seen after clicking on the program name.

So, what did I want to do? There is a set of program data across multiple html pages that I would like to look at in one spreadsheet. The abstract of my Ruby script in colloquial terms is something like this:

  1.  Go to http://kboo.fm/program and collect all the links to radio programs.
  2. Open each radio program html page from that collection and pull out the program name, program status, short description, summary, hosts, and topics.
  3. Export each radio program’s data as a spreadsheet row next to its url, with each piece of information in a separate column with a column header.

My Ruby script and the resulting csv are on my brand new GitHub, check it out!

The script takes about a minute to run through all 151 rows, and I’m not sure if that’s the appropriate amount of time for it to take. I also read that when webscraping, one should space out the server requests or the server may blacklist you–there are ethics to webscraping. I also noticed that I could clean up the array within array data: the host names, topics, and genres still have surrouding brackets around the array.

It took me a while to learn each part of this, and I also used parts of other people’s scripts similar to my need. It also showed me that it takes a lot of trial and error. However, it also showed me that I could work with the logic and figure it out!

There is a lot of data on web pages, and webscraping basics with a programming language like Ruby can help retrieve items of interest and order them into nice csv files, or transform and manipulate them in various ways. Next, I think I can create a similar script that lists all of a program’s episodes with pertinent metadata that can be mapped to a PBCore data model, i.e. episode title, air date, episode description, and audio file location.

Please share any comments you have!

In a recent Flatiron School webinar on Ruby and Javascript, Avi Flombaum recommended these book titles to read on Ruby: The well-grounded Rubyist, Refactoring: Ruby edition, and Practical object-oriented design in Ruby. Learn Ruby the hard way comes up a lot in searches as well.