This repository is under active development.
Phramer is an open-source library for extractive and abstractive text summarization.
-
Clone the project:
git clone git@github.com:phramer/phramer.git cd phramer -
Please make sure that you run in a Docker container or a virtual environment and install the dependencies:
python -m venv --no-site-packages .env . .env/bin/activate pip install -r requirements.txtIf you would like to contribute to the project, please install dev dependencies as well:
. .env/bin/activate pip install -r dev-requirements.txt
To get the data and reproduced models you should pull it from our dvc storage:
dvc pullTo get one specific file pull target dvc file:
dvc pull data.dvcTo continue our work you should make your own DVC remote storage to store your data and models. It may be local storage on you machine or any other cloud service (see dvc docs).
Take a look to our instructions in howto section here to create remote storage.
Then push your data to the storage by running following command:
dvc push- Gigaword dataset (1M articles in Russian languge, full description);
- CNN and Daily Mail datasets (300K long articles in English language, full description);
- RIA News dataset (4M short articles in English language, full description).
Here is instruction how to create you own DVC project by yourself.
-
Create the DVC project:
dvc init # git commit -m "Initialize DVC project" -
Similar to the way you use Git server to store and share your code.
- Local Storage
dvc remote add -d localremote /tmp/dvc-storage # flag -d makes it default storage - Google Cloud Storage
- Create account on Google Cloud and go to the Console.
- Create new bucket (default setup will be okay).
- Go to APIs & Services > Credentials and make new service account key.
- Choose json key type, create and download it to your machine.
- Run
where
export GOOGLE_APPLICATION_CREDENTIALS="[PATH]”
[PATH]is path to json file (e.g./home/user/Downloads/[FILE_NAME].json) - Make sure you have optional DVC deoendencies for Google Cloud Storage:
pip install dvc[gs]
- Add Google Cloud bucket as your remote storage to the DVC:
dvc remote add -d gs gs://yourbucket # your bucket url you can find in *Bucket details > Overview* # by clicking on your bucket
- Local Storage
-
Part of our project use data from this repo. To keep the data always fresh we will import it to our project using import-url command:
dvc import-url https://github.com/RossiyaSegodnya/ria_news_dataset/raw/master/ria.json.gz
-
To add other data to DVC project run following command:
dvc add data/data.xml # git add data/.gitignore data/data.xml.dvc # git commit -m "Add raw data to project"
-
To track the data you get by running scripts you should use following command:
dvc run -f script.dvc \ -d script.py -d input_data \ -o output_data \ python script.py input_data -
After you end with managing your data and models you should
pushit to your remote storage:dvc push
-
Then, you can retrieve your data using
pullcommand:dvc pull