Skip to content

phramer/phramer

Repository files navigation

Phramer

This repository is under active development.

Phramer is an open-source library for extractive and abstractive text summarization.

Installation

  1. Clone the project:

    git clone git@github.com:phramer/phramer.git
    cd phramer
  2. Please make sure that you run in a Docker container or a virtual environment and install the dependencies:

    python -m venv --no-site-packages .env
    . .env/bin/activate
    pip install -r requirements.txt

    If you would like to contribute to the project, please install dev dependencies as well:

    . .env/bin/activate
    pip install -r dev-requirements.txt

Getting data from DVC

To get the data and reproduced models you should pull it from our dvc storage:

dvc pull

To get one specific file pull target dvc file:

dvc pull data.dvc

How to contribute

To continue our work you should make your own DVC remote storage to store your data and models. It may be local storage on you machine or any other cloud service (see dvc docs).

Take a look to our instructions in howto section here to create remote storage.

Then push your data to the storage by running following command:

dvc push

Datasets

  1. Gigaword dataset (1M articles in Russian languge, full description);
  2. CNN and Daily Mail datasets (300K long articles in English language, full description);
  3. RIA News dataset (4M short articles in English language, full description).

Setting up a custom DVC project

Here is instruction how to create you own DVC project by yourself.

  1. Create the DVC project:

    dvc init
    # git commit -m "Initialize DVC project"
  2. Set up remote storage

    Similar to the way you use Git server to store and share your code.

    • Local Storage
      dvc remote add -d localremote /tmp/dvc-storage
      # flag -d makes it default storage
    • Google Cloud Storage
      1. Create account on Google Cloud and go to the Console.
      2. Create new bucket (default setup will be okay).
      3. Go to APIs & Services > Credentials and make new service account key.
      4. Choose json key type, create and download it to your machine.
      5. Run
        export GOOGLE_APPLICATION_CREDENTIALS="[PATH]”
        where [PATH] is path to json file (e.g. /home/user/Downloads/[FILE_NAME].json)
      6. Make sure you have optional DVC deoendencies for Google Cloud Storage:
        pip install dvc[gs]
      7. Add Google Cloud bucket as your remote storage to the DVC:
        dvc remote add -d gs gs://yourbucket
        # your bucket url you can find in *Bucket details > Overview*
        # by clicking on your bucket
  3. Part of our project use data from this repo. To keep the data always fresh we will import it to our project using import-url command:

    dvc import-url https://github.com/RossiyaSegodnya/ria_news_dataset/raw/master/ria.json.gz
  4. To add other data to DVC project run following command:

    dvc add data/data.xml
    # git add data/.gitignore data/data.xml.dvc
    # git commit -m "Add raw data to project"
  5. To track the data you get by running scripts you should use following command:

    dvc run -f script.dvc \
            -d script.py -d input_data \
            -o output_data \
            python script.py input_data
  6. After you end with managing your data and models you should push it to your remote storage:

    dvc push
  7. Then, you can retrieve your data using pull command:

    dvc pull

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors