Category: Digital Curation

Ruby & Nokogiri for Webscraping

As part of the AAPB NDSR fellowship, we residents are afforded personal professional development time to identify and learn or practice skills that will help us in our careers. This is a benefit that I truly appreciate!


Ruby… the jewels theme continues from Perl

I had heard of the many programming and scripting languages like Perl, Python, Ruby, and bash but really didn’t know of examples that showed their power, or applicability. In the past month, I’ve learned that knowing the basics of programming will be essential in data curation, digital humanities, and all around data transformation and management. Librarians and archivists are aware of the vast amount of data: why not utilize programming to automate data handling? Data transformation is useful for mapping and preparing data between different systems, or just getting data in a clearer, easier to read format than what you are presented with.

At the Collections as Data SymposiumHarriett Green presented an update on the HTRC Digging Deeper, Reaching Further Project [presentation slides]. This project provides tools for librarians by creating a curriculum that includes programming in Python. This past spring, the workshops were piloted at Indiana University and University of Illinois–my alma mater :0). I can’t wait until the curriculum becomes more widely available so more librarians and archivists know the power of programming!

Screen Shot 2016-10-17 at 1.32.04 PM.png

But even before the symposium, my NDSR cohort had been chatting about the amounts of data we need to wrangle, manage, and clean up in our different host sites. Lorena and Ashley had a Twitter conversation on Ruby that I caught wind of.  Because of my current project at KBOO, I was interested in webscraping to collect data presented on HTML pages. Webscraping can be achieved by both Python and Ruby. My arbitrary decision to learn Ruby over Python is probably based on the gentle, satisfying sound of the name. I was told that Python is the necessary language for natural language processing. But since my needs were focused on parsing html, and a separate interest in learning how Ruby on Rails functioned, I went with Ruby. Webscraping requires an understanding of the command line and HTML.

  • I installed Ruby on my Mac with Homebrew.
  • I installed Nokogiri.
  • I followed a few tutorials before I realized I had to read more about the fundamentals of the language.

I started with this online tutorial, but there are many other readings to get started in Ruby. Learning the fundamentals of the Ruby language included hands-on programming and following basic examples. After learning the basics of the language, it was time for me to start thinking in the logic of Ruby to compose my script.

Screen Shot 2016-10-17 at 5.25.23 PM.png

Pieces of data are considered objects in Ruby. These objects can be pushed into an array, or set of data.

As a newbie Ruby programmer, I learned that there is a lot I don’t know, there are better and more sophisticated ways to program if I know more, but I can get results now while learning along the way. For example, another way data sets can be manipulated in Ruby is by creating a hash of values. I decided to keep going with the array in my example.

Screen Shot 2016-10-17 at 5.32.55 PM.png lists 151 current on-air programs. I want to see a compact list, in addition to specific program information on the pages seen after clicking on the program name.

So, what did I want to do? There is a set of program data across multiple html pages that I would like to look at in one spreadsheet. The abstract of my Ruby script in colloquial terms is something like this:

  1.  Go to and collect all the links to radio programs.
  2. Open each radio program html page from that collection and pull out the program name, program status, short description, summary, hosts, and topics.
  3. Export each radio program’s data as a spreadsheet row next to its url, with each piece of information in a separate column with a column header.

My Ruby script and the resulting csv are on my brand new GitHub, check it out!

The script takes about a minute to run through all 151 rows, and I’m not sure if that’s the appropriate amount of time for it to take. I also read that when webscraping, one should space out the server requests or the server may blacklist you–there are ethics to webscraping. I also noticed that I could clean up the array within array data: the host names, topics, and genres still have surrouding brackets around the array.

It took me a while to learn each part of this, and I also used parts of other people’s scripts similar to my need. It also showed me that it takes a lot of trial and error. However, it also showed me that I could work with the logic and figure it out!

There is a lot of data on web pages, and webscraping basics with a programming language like Ruby can help retrieve items of interest and order them into nice csv files, or transform and manipulate them in various ways. Next, I think I can create a similar script that lists all of a program’s episodes with pertinent metadata that can be mapped to a PBCore data model, i.e. episode title, air date, episode description, and audio file location.

Please share any comments you have!

In a recent Flatiron School webinar on Ruby and Javascript, Avi Flombaum recommended these book titles to read on Ruby: The well-grounded Rubyist, Refactoring: Ruby edition, and Practical object-oriented design in Ruby. Learn Ruby the hard way comes up a lot in searches as well.


NYPL’s new digital collections with annotation and juxtaposition tools

Earlier this semester, Karoline posted about NYPL’s Digital Gallery. However, last summer the NYPL Labs team launched their new Digital Collections, now in beta, to draw together all their digital materials, including video. In addition, there are video composition tools that allow users to search for content based on their copyright and usage details; place items side by side, annotate, save, and share them. I actually got to play around with the tool as it was being built and content was being digitized last summer in a fellowship.


The new interface is more interactive, and communicates more about the items available than the previous digital gallery. Less clicking, more faceted browsing, and streaming video hopefully brings more traffic and attention to the new platform. The video composition tool allows users to search for NYPL content, but also YouTube content. The idea behind this is that many people are creators of digital content, and the tools support their use of existing content. I actually used the tool to show a middle school class that one of Beyonce’s music videos was “inspired by” or “lifted” moves straight from Bob Fosse’s 1969 performance on The Ed Sullivan Show. When more archival content is digitized and accessible, users can learn and discover facts that would have been less findable.

Because I wasn’t familiar with other simple mashup tools, and I knew about the NYPL Video Tools, I went ahead and used their tool. Sometimes the interface and playback is buggy, so I recorded the playback with Screencast-o-matic, but all in all the tools are great and allow researchers or teachers to use digital content in the classroom.

Memento for Time Travel on the Web

Information on the web has introduced new wants and needs for end-users, researchers, and digital curators alike. The most common web protocol, HTTP, lacks temporal information representation capabilities, so archived web content typically has disconnected URI protocols. Our current web archives are siloed by domain or URI; web navigation across sites in a specific point in time has not yet been developed.

Los Alamos National Lab (LANL) and Old Dominion University are experimenting with providing seamless access to archival content without disrupting the web navigation experience. In 2009, the team introduced a solution to the challenge of finding and navigating to existing historical Web information. Some web servers have archival capabilities, that is, they store versions of content over time. Many other servers have no local archival capabilities, so they only host the current version of web content. Using Transparent Content Negotiation for HTTP (RFC 2295) and developing an API for archives of web resources, they released a Chrome extension that lets a user “time travel”: at each link on a webpage, a user can choose to browse in the present time or a user-chosen date in the past.

Institutions who are still grappling with archiving their web content will still need to fulfill this work in order to take advantage of Memento–this tool builds on existing archived content, and web navigation behavior. However, for those institutions whose archives are set-up, I’d recommend checking out Memento!

The emerging digital stack in 2014–Digital preservation network update

Just in the past couple years, the idea of the Digital Preservation Network (DPN) has gained significant ground, bringing together important thinkers with plenty of storage capacity and technical know-how. The DPN leadership, in a 2013 Educause article, touch on important concepts at the heart of digital curation: “At the heart of DPN is a commitment to replicate the data and metadata of research and scholarship across diverse software architectures, organizational structures, geographic regions, and political environments. Replication diversity, combined with succession rights management and vigorous audit and migration strategies, will ensure that future generations have access to today’s discoveries and insights.”1 The system components include an identifier scheme for asserting and tracking the identity of content in the network, an encryption framework, an audit process for ensuring fixity of packages, a reporting infrastructure, individual, distributed repositories and a messaging system between them, and a distributed registry for recording object location. The node-to-node replication transfer uses the BagIt protocol.2

I admit, I really want to share a colorful model and the snappy phrase “emerging digital stack” because it provides yet another view of the components involved in digital preservation, and thus digital curation. Before I share the model, Here also are some brief facts about the development of the DPN so far.

– The Digital Preservation Network ( architecture plan is drafted.
– The vision: A federated preservation network owned by and for the academy. The focus is on a dark archive for the academic record. DPN is an ecosystem, not a software project. It is designed to evolve with the expected changes in new forms of scholarship, file formats, software, and technology platforms.

Fall 2013
– DPN still in “start-up” phase, goal to connect Academic Preservation Trust (APTrust), Chronopolis, HathiTrust, Stanford Digital Repository (SDR), and University of Texas Digital Repository (UTDR).1
– Started a task force on Audio, Video and Film 3
– DPN Inaugural Board Announced

– Inaugural Charter Member Meeting planned for April.
– More than 50 institutional members
– “DPN is actively progressing three important projects that will create a working proof of concept of the core components.The DPN node sub-group is defining the technical architecture for the replicating nodes. The governance sub-group is defining a governance and sustainability model. Thedata partnerships sub-group is completing an environmental scan of research data preservation.”4

The Emerging Digital Stack

The Emerging Digital Stack

The concepts illustrated by the emerging digital stack help bring “greater coherence and interoperability to the digital preservation space.”1 The ecosystem connects many open, flexible standards, and allows data owners to keep control of their data. The repositories chosen have focused on long-term preservation. Through the service components provided, the DPN works towards the development of metadata standards and its preservation, and the development of trustworthiness and integrity of data through its auditing systems.

1 The Case for Building a Digital Preservation Network.

2 Digital Preservation Network Wiki

4 Digital Preservation Network Website

How social media sites are (and aren’t) like digital curation

In my Digital Asset Management course, I made a case for Facebook as a DAM. I still see social media as a powerful and very useful internet-enabled filter to the world. Before we were enamored with social media darlings like Pinterest, Twitter, Tumblr, Instagram, and Facebook, there seemed to be a lot of good talk about push/pull technology. Back then, RSS feeds were innovative. Push/pull still exists, it just has cooler exteriors. Yes, I let Facebook (and Jon Stewart) curate my news. I follow pages (news feeds) of organizations I care about, and of course I care about the news my friends care about as well, the assumption being we have shared interests. The news feed selection process is not manual, it is built in, and so I have a readily curated list of socially relevant topics to read.

Social media sites are definitely not like digital curation as well! No authoritative metadata, source information, or guarantee of preserved data. The technology is proprietary so I am only the user who can’t question the longevity of information. Social media sites serve up links and content that may very well disappear the next day. Sharing is not always caring–understandably, I can be averse to many items in my news feed. If I want to rely on information, social media is not the place to go for it, considering I need to see a citation and check it twice.

UPDATE 4/28/2014: I wanted to update this with new thoughts regarding authoritative digital curation sites and social media. I believe that if people want to reach the audiences intended by each social media platform, they will have to cater to them by rewriting content and curating it for those specific platforms. How authoritative the metadata is may not be an issue, if the objects are meant to attract attention and link back to an authoritative source.