Trafilatura - Installing and running from the terminal

Trafilatura ... includes all necessary discovery and text processing components to perform web crawling, downloads, scraping, and extraction of main texts, metadata and comments. It aims at staying handy and modular: no database is required, the output can be converted to commonly used formats.

In this series, we’re going to explore a powerful Python library called Trafilatura. Each post will gradually build on the previous one, adding complexity step by step to deepen your understanding and take you from running Trafilatura in the terminal to deploying your own Trafilatura web server.

Requirements

Python

In this first post, we’ll simply install Trafilatura and run it in a few different ways from the terminal.

Open your terminal, navigate to your projects directory, and run:

mkdir trafi_api
cd trafi_api
touch main.py
git init

At this point, you should create and activate a Python virtual environment for this project. This step is optional but recommended.

Now, install Trafilatura:

pip install trafilatura
pip freeze > requirements.txt
git add .
git commit -m "first commit"

Confirm that the installation worked by running the following command:

trafilatura -u "https://boilercode.io/blog/trafilatura"

You’ll notice that it quickly scrapes and extracts all the relevant text from this very blog post.

By default, Trafilatura outputs plain text, but it also supports several other formats. You can specify an output format using any of the following flags:

--csv, --html, --json, --markdown, --xml or --xmltei

Try running:

trafilatura -u "https://boilercode.io/blog/trafilatura" --markdown --links --images

What’s Next?

In the next post in this series, we’ll build a web server with a couple of Trafilatura-specific endpoints to scrape just about any page with a simple API call.