Operationalizing Large Language Models (LLMs) is the next big opportunity in AI. Any organization…
How to Easily Archive Vulnerable Environmental Data
The Internet Archive, also known as the Wayback Machine, has been capturing snapshots of the World Wide Web for over 20 years. But a new effort to save a specific type of Internet content- scientific environmental data- has emerged among some scientists and professors. Groups like the Penn Program in Environmental Humanities and its offshoot Data Refuge are committed to preserving “…the facts we need at a time of ongoing climate change.”
These groups encourage others to download and store public scientific data. They’ve acted as a catalyst for groups around the country to host data meetups. At Nexla, we are committed to building tools that make it easier to collaborate with data. In that spirit, we’d like to share two of the methods we’ve found for archiving important data sets.
PSA: There are many efforts to archive internet content in the public domain. Before making your own copy of a website, you might want to visit some of these to see if the content is already archived.
Photo by Beth Scupham
Method One: Wget
There are many tools available for archiving data and web sites accessible via common protocols like HTTP and FTP. Some tools, such as curl and Wget, come pre-installed with most modern operating systems.
Curl and Wget are both well-supported command-line tools that allow you to fetch remote files. Curl additionally supports programmatic downloads via its libcurl library. With curl you can download only one URL at a time. Wget supports recursively fetching links referenced within a web page. Both tools are sophisticated and handy. For an in-depth comparison, we recommend reading the notes from the author of curl himself, Daniel Stenberg.
At Nexla, we ran several experiments using Wget to archive complex websites. Here’s a sample command line with the options we found most useful:
wget -mk --random-wait -x http://climate.nasa.gov
-mk tells Wget to mirror the site and to convert links in the site so that they can be viewed in your local copy.
--random-wait causes wget to insert pauses of random length between accesses–useful for not over-taxing the server.
-x tells Wget to recreate, as best it can, the directory hierarchy of the site.
We found Wget to have some issues with recognizing some media types, such as css, .js, .mp4 and saving them in the local copy. Your mileage may vary with clever tweaking of Wget options.
Method Two: HTTrack
HTTrack is a popular free tool designed for cloning web sites. You can find it here. You may find HTTrack to be a better tool for hands-free cloning of complex sites with lots of links and embedded media types.
HTTrack is smart about following links in web pages it’s downloading, handling various media types, and varying the wait times between accesses to avoid overloading the server. HTTrack can resume an interrupted download with “
-i option”. We found this quite handy in cloning climate.nasa.gov. Here’s the command we issued:
httrack -i climate.nasa.gov
Once you’ve downloaded a copy of the data or web site, you may want to make it available to others or archive it in the cloud. For Nexla’s mirror of climate.nasa.gov, we’re using an Amazon S3 bucket.
S3cmd is a handy command-line tool for working with S3. You can use it to add and remove files, list what you’ve got, and keep local directories in sync with S3 buckets. We used one command to move our entire copy of climate.nasa.gov up to S3:
s3cmd sync ./climate.nasa.gov s3://archive.nexla.com
Note that you’ll need to have your Amazon AWS account access keys to use s3cmd. Enter:
on the command line and it will guide you through the setup. Of course, there are lots of cloud storage options available, but we focused on tools for S3. You might want to explore Google Cloud, Glacier, or Backblaze and compare them based on price, access methods, and reliability.
Be a Polite, Law-Abiding Data Archiver
Nexla recommends observing all applicable copyright laws and using common courtesy when archiving data and web sites. Tools like Wget and HTTrack should not be used to copy data to which you do not have legal access. Just because a web site is publicly accessible does not mean it can be legally copied or mirrored.
Be considerate to the web masters at the sites you wish to copy: don’t overload their servers. In fact, the HTTrack folks have some handy guidelines for good web archiving behavior. Whether you plan to join the efforts of Data Refuge or would like to download public data for your own research, we hope you find these tools useful.
If you liked this post, please share it! To receive more data archiving and operations tips, subscribe to the Nexla blog.
Unify your data operations today!
Discover how Nexla’s powerful data operations can put an end to your data challenges with our free demo.