Skip to content

devobern/URL-Archiver

Repository files navigation

URL-Archiver

The URL archiver enables the extraction of URLs from any Unicode text or PDF file and allows for interactive archiving on one of the supported archiving services.

⚠️ The application was designed to be platform-independent. However, it has only been tested on the following systems, so it cannot be guaranteed to work without restrictions on other platforms.

  • Windows 11 (Version 23H2)
  • Windows 10 (Version 22H2)
  • macOS (Ventura)
  • Ubuntu (20.04.3 LTS)

Authors

Supervisor

Installation

Requirements

To build and start the application, ensure that the following dependencies are installed on your system:

  • Git: Latest stable version recommended.
  • Maven: Version 3.8 or higher.
  • Java: Version 21.

Clone the repository

To clone the repository, run the following command in a terminal:

git clone https://github.com/devobern/URL-Archiver.git

Build and run scripts

The build and run scripts are provided for Windows (build.ps1, run.ps1, build_and_run.ps1), Linux and MacOS (build.sh, run.sh, build_and_run.sh). The scripts are located in the root directory of the project.

⚠️ The scripts need to be executable. To make them executable, run the following command in a terminal:

  • Linux / MacOS: chmod +x build.sh run.sh build_and_run.sh
  • Windows:
    • Open PowerShell as an Administrator.
    • Check the current execution policy by running: Get-ExecutionPolicy.
    • If the policy is Restricted, change it to RemoteSigned to allow local scripts to run. Execute: Set-ExecutionPolicy RemoteSigned.
    • Confirm the change when prompted.
    • This change allows you to run PowerShell scripts that are written on your local machine. Be sure to only run scripts from trusted sources.

Windows

Build the application

To build the application, open a command prompt and run the following script:

./build.ps1

Run the application

To run the application, open a command prompt and run the following script:

./run.ps1

Build and run the application

To build and run the application, open a command prompt and run the following script:

./build_and_run.ps1

Linux

Build the application

To build the application, open a terminal and run the following script:

./build.sh

Run the application

To run the application, open a terminal and run the following script:

./run.sh

Build and run the application

To build and run the application, open a terminal and run the following script:

./build_and_run.sh

MacOS

Build the application

To build the application, open a terminal and run the following script:

./build.sh

Run the application

To run the application, open a terminal and run the following script:

./run.sh

Build and run the application

To build and run the application, open a terminal and run the following script:

./build_and_run.sh

User Manual

⚠️ To follow the instructions in this section, the application must be built, see Installation.

The URL-Archiver is a user-friendly application designed for extracting and archiving URLs from text and PDF files. Its intuitive interface requires minimal user input and ensures efficient management of URLs.

Getting Started

Windows

Open Command Prompt, navigate to the application's directory, and execute:

./run.ps1

Linux / MacOS

Open Terminal, navigate to the application's directory, and run:

./run.sh

Operating Instructions

Upon launch, provide a path to a text or PDF file, or a directory containing such files. The application will process and display URLs sequentially.

Navigation

Use the following keys to navigate through the application:

  • o: Open the current URL in the default web browser.
  • a: Access the Archive Menu to archive the URL.
  • s: Show a list of previously archived URLs.
  • u: Update and view pending archive jobs.
  • n: Navigate to the next URL.
  • q: Quit the application.
  • c: Change application settings.
  • h: Access the Help Menu for assistance.

Archiving URLs

Choose between archiving to Wayback Machine, Archive.today, both, or canceling.

When opting to use Archive.today for archiving, an automated browser session will initiate, requiring you to complete a captcha. Once resolved, the URL is archived, and the corresponding archived version is then collected and stored within the application.

Configuration

Customize Access/Secret Keys and the default browser. Current settings are shown with default values in brackets.

To get your S3-Credentials, follow the instructions in Getting S3-Credentials (Wayback Machine).

Exiting

  • To exit, press q. If a Bibtex file was provided, you'll be prompted to save the archived URLs in the Bibtex file.
  • Otherwise, or after saving the URLs in the Bibtex file, you'll be prompted to save the archived URLs in a CSV file.

For Bibtex entries:

  • Without an existing note field, URLs are added as: note = {Archived Versions: \url{url1}, \url{url2}}
  • With a note field, they're appended as: note = {, Archived Versions: \url{url1}, \url{url2}}

Getting S3-Credentials (Wayback Machine)

To generate your S3-Credentials, you need a Wayback Machine profile, which you can create here.

Generate S3-Credentials

  1. Login to your Wayback Machine profile here.
  2. Open this link to generate your S3-Credentials. If needed you can also delete your S3-Credentials on this page.

Project Status and Future Contributions

Current Development Status

Currently, the development of URL-Archiver by the original team is on hold. This is due to our work and academic commitments, which prevent us from dedicating the necessary time to further develop the project in the near future.

Open for Contributions

We welcome and encourage the open-source community to contribute to the development and enhancement of the URL-Archiver.

If you are interested in contributing, please ensure that any contributions adhere to the project's existing license terms.

We look forward to seeing how the URL-Archiver grows and evolves with the community's support and contributions.

Future Work

While we are currently not in a position to actively pursue these enhancements, we believe the following improvements would significantly contribute to the project's evolution and utility:

  • Improving the URL extraction algorithm for more efficient and accurate results.
  • Expanding support for various input file types.
  • Implementing a user-friendly graphical interface.
  • Enabling multilingual support for global accessibility.
  • Automatically archiving all URLs in a file for efficiency.
  • Providing more detailed setting options for user customization.
  • Publishing the application in package repositories to simplify installation.
  • Improving the code layout, like breaking up the controller for better clarity.
  • Testing on other platforms (such as Fedora) to ensure platform independence
  • Support for command line arguments like ./run.sh --archive both --urlsource /tmp/my_url_list.txt or ./run.sh --archive today --url https://example.com

Deinstallation

To deinstall the application, simply delete the folder containing the application.

Licenses and Attributions

This project uses the following open-source software:

Library License
JUnit Jupiter API Eclipse Public License v2.0 (EPL-2.0)
JUnit Jupiter Engine Eclipse Public License v2.0 (EPL-2.0)
Selenium Java Apache License 2.0
Selenium Logger MIT License
Mockito Core MIT License
Mockito JUnit Jupiter MIT License
System Lambda MIT License
Apache PDFBox Apache License 2.0
Jackson Core Apache License 2.0
Jackson Dataformat XML Apache License 2.0

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages