From 46390bde4927485bcbea9f0ea45f621c79e63060 Mon Sep 17 00:00:00 2001 From: Jay Prakash Date: Wed, 1 Jan 2025 13:59:51 +0530 Subject: [PATCH 1/4] Add files via upload --- README (6).md | 134 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 134 insertions(+) create mode 100644 README (6).md diff --git a/README (6).md b/README (6).md new file mode 100644 index 0000000..ea25f9a --- /dev/null +++ b/README (6).md @@ -0,0 +1,134 @@ +# Multilingual Neural Machine Translation System for TV News + +_This is my [Google summer of Code 2018](https://summerofcode.withgoogle.com/projects/#6685973346254848) Project with [the Distributed Little Red Hen Lab](http://www.redhenlab.org/)._ + +The aim of this project is to build a Multilingual Neural Machine Translation System, which would be capable of translating Red Hen Lab's TV News Transcripts from different source languages to English. + +The system uses Reinforcement Learning (Advantage-Actor-Critic algorithm) on the top of neural encoder-decoder architecture and outperforms the results obtained by simple Neural Machine Translation which is based upon maximum log-likelihood training. Our system achieves close to state-of-the-art results on the standard WMT (Workshop on Machine Translation) test datasets. + +This project is inspired by the approaches mentioned in the paper [An Actor-Critic Algorithm for Sequence Prediction](https://arxiv.org/pdf/1607.07086). + +I have made a GSoC blog; please refer to it for all my GSoC blog posts about the progress made so far. +Blog link: https://vikrant97.github.io/gsoc_blog/ + +The following languages are supported as the source language & the below are their language codes: +1) **German - de** +2) **French - fr** +3) **Russian - ru** +4) **Czech - cs** +5) **Spanish - es** +6) **Portuguese - pt** +7) **Danish - da** +8) **Swedish - sv** +9) **Chinese - zh** +The target language is English (en). + +## Getting Started + +### Prerequisites + +* Python-2.7 +* Pytorch-0.3 +* Tensorflow-gpu +* Numpy +* CUDA + +### Installation & Setup Instructions on CASE HPC + +* Users who want the pipeline to work on case HPC, just copy the directory named **nmt** from the home directory of my HPC account, i.e., **/home/vxg195**, and then follow the instructions described for training & translation. + +* The **nmt** directory will contain the following subdirectories: + * singularity + * data + * models + * Neural-Machine-Translation + * myenv + +* The **singularity** directory contains a singularity image (rh_xenial_20180308.img), which is copied from the home directory of **Mr. Michael Pacchioli's CASE HPC account**. This singularity image contains some modules like CUDA and CUDANN needed for the system. + +* The **data** directory consists of cleaned & processed datasets of respective language pairs. The subdirectories of this directory should be named like **de-en**, where **de** & **en** are the language codes for **German** & **English**. So, for any general language pair whose source language is **$src** and the target language is **$tgt**, the language data subdirectory should be named as **$src-$tgt**, and it should contain the following files (train, validation & test): + * train.$src-$tgt.$src.processed + * train.$src-$tgt.$tgt.processed + * valid.$src-$tgt.$src.processed + * valid.$src-$tgt.$tgt.processed + * test.$src-$tgt.$src.processed + * test.$src-$tgt.$tgt.processed + +* The **models** directory consists of trained models for the respective language pairs and also follows the same structure of subdirectories as the **data** directory. For example, **models/de-en** will contain trained models for the **German-English** language pair. + +* The following commands were used to install dependencies for the project: + ```bash + $ git clone https://github.com/RedHenLab/Neural-Machine-Translation.git + $ virtualenv myenv + $ source myenv/bin/activate + ``` + +* The virtual environment activation command for Windows is as follows: + ```bash + $ myenv\Scripts\activate + ``` + + After activating the virtual environment, run the following: + ```bash + $ pip install -r Neural-Machine-Translation/requirements.txt + ``` + +* **Note** that the virtual environment (myenv) created using the virtualenv command mentioned above should be of **Python2**. + +## Data Preparation and Preprocessing + +Please note that these data preparation steps have to be done manually as we are dealing with a Multilingual system, and each language pair might have different sources of data. For instance, I used many different data sources like europarl, newscommentary, commoncrawl & other open-source datasets. One can have a look at the shared task on Machine Translation, i.e., WMT, to get better datasets. I wrote a bash script that can be used to process & prepare datasets for MT. The following steps can be used to prepare a dataset for MT: +1) First copy the raw dataset files in the language ($src-$tgt) subdirectory of the data directory in the following format: + * train.$src-$tgt.$src + * train.$src-$tgt.$tgt + * valid.$src-$tgt.$src + * valid.$src-$tgt.$tgt + * test.$src-$tgt.$src + * test.$src-$tgt.$tgt + +2) Now create an empty directory named $src-$tgt in the Neural-Machine-Translation/subword_nmt directory. Copy the file named "prepare_data.sh" into the language subdirectory for which we need to prepare the dataset. Then use the following commands to process the dataset for training: + ```bash + bash prepare_data.sh $src $tgt + ``` + After this process, clear the entire language directory & just keep *.processed files. Your processed dataset is ready!! + +## Training + +To train a model on CASE HPC, one needs to run the train.sh file placed in the Neural-Machine-Translation/scripts folder. The parameters for training are kept such that a model can be efficiently trained for any newly introduced language pair, but one needs to tune the parameters according to the dataset. The prerequisite for training a model is that the parallel data as described in the **Installation** section should be residing in the concerned language pair directory in the data folder. The trained models will be saved in the language pair directory in the models folder. To train a model on CASE HPC, run the following command: + + ```bash + cd Neural-Machine-Translation/scripts + sbatch train.sh + # For example, to train a model for German->English one should type the following command + sbatch train.sh de en + ``` +After training, the trained model will be saved in the language ($src-$tgt) subdirectory in the models directory. The saved model would be something like "model_15.pt," and it should be renamed to "model_15_best.pt." + +## Translation +This project supports the translation of both normal text files and news transcripts in any supported language pair. +To translate any input news transcript, run the following commands: + ```bash + cd Neural-Machine-Translation/scripts + sbatch translate.sh 0 + ``` +To translate any normal text file, run the following commands: + ```bash + cd Neural-Machine-Translation/scripts + sbatch translate.sh 1 + ``` +**Note that the output translated file will be saved in the same directory containing the input file with a ".pred" string appended to the name of the input file.** + +## Evaluation of the trained model +For evaluation, generate a translation of any source test corpora. Now, we need to test its efficiency against the original target test corpus. For this, we use the multi-bleu.perl script residing in the scripts directory, which measures the corpus BLEU score. Usage instructions: +```bash +perl scripts/multi-bleu.perl $reference-file < $hypothesis-file +``` + +## Acknowledgements + +* [Google Summer of Code 2018](https://summerofcode.withgoogle.com/) +* [Red Hen Lab](http://www.redhenlab.org/) +* [OpenNMT-py](https://github.com/OpenNMT/OpenNMT-py) +* [An Actor-Critic Algorithm for Sequence Prediction](https://arxiv.org/pdf/1607.07086) +* [Europarl](http://www.statmt.org/europarl/) +* [Moses](https://github.com/moses-smt/mosesdecoder) From 075135c456cca20b52ffb285085b571aaf75574f Mon Sep 17 00:00:00 2001 From: Jay Prakash Date: Wed, 1 Jan 2025 14:02:24 +0530 Subject: [PATCH 2/4] Delete README.md --- README.md | 124 ------------------------------------------------------ 1 file changed, 124 deletions(-) delete mode 100644 README.md diff --git a/README.md b/README.md deleted file mode 100644 index 85153ec..0000000 --- a/README.md +++ /dev/null @@ -1,124 +0,0 @@ -# Multilingual Neural Machine Translation System for TV News - -_This is my [Google summer of Code 2018](https://summerofcode.withgoogle.com/projects/#6685973346254848) Project with [the Distributed Little Red Hen Lab](http://www.redhenlab.org/)._ - -The aim of this project is to build a Multilingual Neural Machine Translation System, which would be capable of translating Red Hen Lab's TV News Transcripts from different source languages to English. - -The system uses Reinforcement Learning(Advantage-Actor-Critic algorithm) on the top of neural encoder-decoder architecture and outperforms the results obtained by simple Neural Machine Translation which is based upon maximum log-likelihood training. Our system achieves close to state-of-the-art results on the standard WMT(Workshop on Machine Translation) test datasets. - -This project is inspired by the approaches mentioned in the paper [An Actor-Critic Algorithm for Sequence Prediction](https://arxiv.org/pdf/1607.07086). - -I have made a GSoC blog, please refer to it for my all GSoC blogposts about the progress made so far. -Blog link: https://vikrant97.github.io/gsoc_blog/ - -The following languages are supported as the source language & the below are their language codes: -1) **German - de** -2) **French - fr** -3) **Russian - ru** -4) **Czech - cs** -5) **Spanish - es** -6) **Portuguese - pt** -7) **Danish - da** -8) **Swedish - sv** -9) **Chinese - zh** -The target language is English(en). - -## Getting Started - -### Prerequisites - -* Python-2.7 -* Pytorch-0.3 -* Tensorflow-gpu -* Numpy -* CUDA - -### Installation & Setup Instructions on CASE HPC - -* Users who want the pipeline to work on case HPC, just copy the directory named **nmt** from the home directory of my hpc acoount i.e **/home/vxg195** & then follow the instructions described for training & translation. - -* nmt directory will contain the following subdirectories: - * singularity - * data - * models - * Neural-Machine-Translation - * myenv - -* The **singularity** directory contains a singularity image(rh_xenial_20180308.img) which is copied from the home directory of **Mr. Michael Pacchioli's CASE HPC account**. This singularity image contains some modules like CUDA and CUDANN needed for the system. - -* The **data** directory consists of cleaned & processed datasets of respective language pairs. The subdirectories of this directory should be named like **de-en** where **de** & **en** are the language codes for **German** & **English**. So for any general language pair whose source language is **$src** and the target language is **$tgt**, the language data subdirectory should be named as **$src-$tgt** and it should contain the following files(train, validation & test): - * train.$src-$tgt.$src.processed - * train.$src-$tgt.$tgt.processed - * valid.$src-$tgt.$src.processed - * valid.$src-$tgt.$tgt.processed - * test.$src-$tgt.$src.processed - * test.$src-$tgt.$tgt.processed - -* The **models** directory consists of trained models for the respective language pairs and also follows the same structure of subdirectories as **data** directory. For example, **models/de-en** will contains trained models for the **German-English** language pair. - -* The following commands were used to install dependencies for the project: - ```bash - $ git clone https://github.com/RedHenLab/Neural-Machine-Translation.git - $ virtualenv myenv - $ source myenv/bin/activate - $ pip install -r Neural-Machine-Translation/requirements.txt - ``` -* **Note** that the virtual environment(myenv) created using virtualenv command mentioned above, should be of **Python2** . - -## Data Preparation and Preprocessing - -Please note that these data preparation steps have to be done manually as we are dealing with a Multilingual system and each language pair might have different sources of data. For instance, I used many different data sources like europarl, newscommentary, commoncrawl & other opern source datasets. One can have a look at shared task on Machine Translation i.e. WMT, to get better datasets. I wrote a bash script which can be used to process & prepare dataset for MT. The following steps can be used to prepare dataset for MT: -1) First copy the raw dataset files in the language($src-$tgt) subdirectory of the data directory in the following format: - * train.$src-$tgt.$src - * train.$src-$tgt.$tgt - * valid.$src-$tgt.$src - * valid.$src-$tgt.$tgt - * test.$src-$tgt.$src - * test.$src-$tgt.$tgt - -2) Now create an empty directory named $src-$tgt in the Neural-Machine-Translation/subword_nmt directory. Copy the file named "prepare_data.sh" into the language subdirectory for which we need to prepare the dataset. Then use the following commands to process the dataset for training: - ```bash - bash prepare_data.sh $src $tgt - ``` - After this process, clear the entire language directory & just keep \*.processed files. Your processed dataset is ready!! - -## Training - -To train a model on CASE HPC one needs to run the train.sh file placed in Neural-Machine-translation/scripts folder. The parameters for training are kept such that a model can be efficiently trained for any newly introduced language pair, but one needs to tune the parameters according to the dataset. The prerequisite for training a model is that the parallel data as described in **Installation** section should be residing in the concerned language pair directory in the data folder. The trained models will be saved in the language pair directory in the models folder. To train a model on CASE HPC, run the following command: - - ```bash - cd Neural-Machine-Translation/scripts - sbatch train.sh - # For example to train a model for German->English one should type the following command - sbatch train.sh de en - ``` -After training, the trained model will be saved in language($src-$tgt) subdirectory in the models directory. The saved model would be something like "model_15.pt" and it should be renamed to "model_15_best.pt". - -## Translation -This project supports translation of both normal text file and news transcripts in any supported language pair. -To translate any input news transcript, run the following commands: - ```bash - cd Neural-Machine-Translation/scripts - sbatch translate.sh 0 - ``` -To translate any normal text file, run the following commands: - ```bash - cd Neural-Machine-Translation/scripts - sbatch translate.sh 1 - ``` -**Note that the output translated file will be saved in the same directory containing the input file and with a ".pred" string appended to the name of the input file.** - -## Evaluation of the trained model -For evaluation, generate translation of any source test corpora. Now, we need to test its efficiency against the original target test corpus. For this, we use multi-bleu.perl script residing in the scripts directory which measures the corpus BLEU score. Usage instructions: -```bash -perl scripts/multi-bleu.perl $reference-file < $hypothesis-file -``` - -## Acknowledgements - -* [Google Summer of Code 2018](https://summerofcode.withgoogle.com/) -* [Red Hen Lab](http://www.redhenlab.org/) -* [OpenNMT-py](https://github.com/OpenNMT/OpenNMT-py) -* [An Actor-Critic Algorithm for Sequence Prediction](https://arxiv.org/pdf/1607.07086) -* [Europarl](http://www.statmt.org/europarl/) -* [Moses](https://github.com/moses-smt/mosesdecoder) From 616c7c81bf551ccdb8d69e25b99e8a5c07784f28 Mon Sep 17 00:00:00 2001 From: Jay Prakash Date: Wed, 1 Jan 2025 14:02:48 +0530 Subject: [PATCH 3/4] Delete README (6).md --- README (6).md | 134 -------------------------------------------------- 1 file changed, 134 deletions(-) delete mode 100644 README (6).md diff --git a/README (6).md b/README (6).md deleted file mode 100644 index ea25f9a..0000000 --- a/README (6).md +++ /dev/null @@ -1,134 +0,0 @@ -# Multilingual Neural Machine Translation System for TV News - -_This is my [Google summer of Code 2018](https://summerofcode.withgoogle.com/projects/#6685973346254848) Project with [the Distributed Little Red Hen Lab](http://www.redhenlab.org/)._ - -The aim of this project is to build a Multilingual Neural Machine Translation System, which would be capable of translating Red Hen Lab's TV News Transcripts from different source languages to English. - -The system uses Reinforcement Learning (Advantage-Actor-Critic algorithm) on the top of neural encoder-decoder architecture and outperforms the results obtained by simple Neural Machine Translation which is based upon maximum log-likelihood training. Our system achieves close to state-of-the-art results on the standard WMT (Workshop on Machine Translation) test datasets. - -This project is inspired by the approaches mentioned in the paper [An Actor-Critic Algorithm for Sequence Prediction](https://arxiv.org/pdf/1607.07086). - -I have made a GSoC blog; please refer to it for all my GSoC blog posts about the progress made so far. -Blog link: https://vikrant97.github.io/gsoc_blog/ - -The following languages are supported as the source language & the below are their language codes: -1) **German - de** -2) **French - fr** -3) **Russian - ru** -4) **Czech - cs** -5) **Spanish - es** -6) **Portuguese - pt** -7) **Danish - da** -8) **Swedish - sv** -9) **Chinese - zh** -The target language is English (en). - -## Getting Started - -### Prerequisites - -* Python-2.7 -* Pytorch-0.3 -* Tensorflow-gpu -* Numpy -* CUDA - -### Installation & Setup Instructions on CASE HPC - -* Users who want the pipeline to work on case HPC, just copy the directory named **nmt** from the home directory of my HPC account, i.e., **/home/vxg195**, and then follow the instructions described for training & translation. - -* The **nmt** directory will contain the following subdirectories: - * singularity - * data - * models - * Neural-Machine-Translation - * myenv - -* The **singularity** directory contains a singularity image (rh_xenial_20180308.img), which is copied from the home directory of **Mr. Michael Pacchioli's CASE HPC account**. This singularity image contains some modules like CUDA and CUDANN needed for the system. - -* The **data** directory consists of cleaned & processed datasets of respective language pairs. The subdirectories of this directory should be named like **de-en**, where **de** & **en** are the language codes for **German** & **English**. So, for any general language pair whose source language is **$src** and the target language is **$tgt**, the language data subdirectory should be named as **$src-$tgt**, and it should contain the following files (train, validation & test): - * train.$src-$tgt.$src.processed - * train.$src-$tgt.$tgt.processed - * valid.$src-$tgt.$src.processed - * valid.$src-$tgt.$tgt.processed - * test.$src-$tgt.$src.processed - * test.$src-$tgt.$tgt.processed - -* The **models** directory consists of trained models for the respective language pairs and also follows the same structure of subdirectories as the **data** directory. For example, **models/de-en** will contain trained models for the **German-English** language pair. - -* The following commands were used to install dependencies for the project: - ```bash - $ git clone https://github.com/RedHenLab/Neural-Machine-Translation.git - $ virtualenv myenv - $ source myenv/bin/activate - ``` - -* The virtual environment activation command for Windows is as follows: - ```bash - $ myenv\Scripts\activate - ``` - - After activating the virtual environment, run the following: - ```bash - $ pip install -r Neural-Machine-Translation/requirements.txt - ``` - -* **Note** that the virtual environment (myenv) created using the virtualenv command mentioned above should be of **Python2**. - -## Data Preparation and Preprocessing - -Please note that these data preparation steps have to be done manually as we are dealing with a Multilingual system, and each language pair might have different sources of data. For instance, I used many different data sources like europarl, newscommentary, commoncrawl & other open-source datasets. One can have a look at the shared task on Machine Translation, i.e., WMT, to get better datasets. I wrote a bash script that can be used to process & prepare datasets for MT. The following steps can be used to prepare a dataset for MT: -1) First copy the raw dataset files in the language ($src-$tgt) subdirectory of the data directory in the following format: - * train.$src-$tgt.$src - * train.$src-$tgt.$tgt - * valid.$src-$tgt.$src - * valid.$src-$tgt.$tgt - * test.$src-$tgt.$src - * test.$src-$tgt.$tgt - -2) Now create an empty directory named $src-$tgt in the Neural-Machine-Translation/subword_nmt directory. Copy the file named "prepare_data.sh" into the language subdirectory for which we need to prepare the dataset. Then use the following commands to process the dataset for training: - ```bash - bash prepare_data.sh $src $tgt - ``` - After this process, clear the entire language directory & just keep *.processed files. Your processed dataset is ready!! - -## Training - -To train a model on CASE HPC, one needs to run the train.sh file placed in the Neural-Machine-Translation/scripts folder. The parameters for training are kept such that a model can be efficiently trained for any newly introduced language pair, but one needs to tune the parameters according to the dataset. The prerequisite for training a model is that the parallel data as described in the **Installation** section should be residing in the concerned language pair directory in the data folder. The trained models will be saved in the language pair directory in the models folder. To train a model on CASE HPC, run the following command: - - ```bash - cd Neural-Machine-Translation/scripts - sbatch train.sh - # For example, to train a model for German->English one should type the following command - sbatch train.sh de en - ``` -After training, the trained model will be saved in the language ($src-$tgt) subdirectory in the models directory. The saved model would be something like "model_15.pt," and it should be renamed to "model_15_best.pt." - -## Translation -This project supports the translation of both normal text files and news transcripts in any supported language pair. -To translate any input news transcript, run the following commands: - ```bash - cd Neural-Machine-Translation/scripts - sbatch translate.sh 0 - ``` -To translate any normal text file, run the following commands: - ```bash - cd Neural-Machine-Translation/scripts - sbatch translate.sh 1 - ``` -**Note that the output translated file will be saved in the same directory containing the input file with a ".pred" string appended to the name of the input file.** - -## Evaluation of the trained model -For evaluation, generate a translation of any source test corpora. Now, we need to test its efficiency against the original target test corpus. For this, we use the multi-bleu.perl script residing in the scripts directory, which measures the corpus BLEU score. Usage instructions: -```bash -perl scripts/multi-bleu.perl $reference-file < $hypothesis-file -``` - -## Acknowledgements - -* [Google Summer of Code 2018](https://summerofcode.withgoogle.com/) -* [Red Hen Lab](http://www.redhenlab.org/) -* [OpenNMT-py](https://github.com/OpenNMT/OpenNMT-py) -* [An Actor-Critic Algorithm for Sequence Prediction](https://arxiv.org/pdf/1607.07086) -* [Europarl](http://www.statmt.org/europarl/) -* [Moses](https://github.com/moses-smt/mosesdecoder) From a381f1f79b5354eca7769562e5b66ba2c5b88d89 Mon Sep 17 00:00:00 2001 From: Jay Prakash Date: Wed, 1 Jan 2025 14:05:51 +0530 Subject: [PATCH 4/4] Add files via upload --- README.md | 134 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 134 insertions(+) create mode 100644 README.md diff --git a/README.md b/README.md new file mode 100644 index 0000000..ea25f9a --- /dev/null +++ b/README.md @@ -0,0 +1,134 @@ +# Multilingual Neural Machine Translation System for TV News + +_This is my [Google summer of Code 2018](https://summerofcode.withgoogle.com/projects/#6685973346254848) Project with [the Distributed Little Red Hen Lab](http://www.redhenlab.org/)._ + +The aim of this project is to build a Multilingual Neural Machine Translation System, which would be capable of translating Red Hen Lab's TV News Transcripts from different source languages to English. + +The system uses Reinforcement Learning (Advantage-Actor-Critic algorithm) on the top of neural encoder-decoder architecture and outperforms the results obtained by simple Neural Machine Translation which is based upon maximum log-likelihood training. Our system achieves close to state-of-the-art results on the standard WMT (Workshop on Machine Translation) test datasets. + +This project is inspired by the approaches mentioned in the paper [An Actor-Critic Algorithm for Sequence Prediction](https://arxiv.org/pdf/1607.07086). + +I have made a GSoC blog; please refer to it for all my GSoC blog posts about the progress made so far. +Blog link: https://vikrant97.github.io/gsoc_blog/ + +The following languages are supported as the source language & the below are their language codes: +1) **German - de** +2) **French - fr** +3) **Russian - ru** +4) **Czech - cs** +5) **Spanish - es** +6) **Portuguese - pt** +7) **Danish - da** +8) **Swedish - sv** +9) **Chinese - zh** +The target language is English (en). + +## Getting Started + +### Prerequisites + +* Python-2.7 +* Pytorch-0.3 +* Tensorflow-gpu +* Numpy +* CUDA + +### Installation & Setup Instructions on CASE HPC + +* Users who want the pipeline to work on case HPC, just copy the directory named **nmt** from the home directory of my HPC account, i.e., **/home/vxg195**, and then follow the instructions described for training & translation. + +* The **nmt** directory will contain the following subdirectories: + * singularity + * data + * models + * Neural-Machine-Translation + * myenv + +* The **singularity** directory contains a singularity image (rh_xenial_20180308.img), which is copied from the home directory of **Mr. Michael Pacchioli's CASE HPC account**. This singularity image contains some modules like CUDA and CUDANN needed for the system. + +* The **data** directory consists of cleaned & processed datasets of respective language pairs. The subdirectories of this directory should be named like **de-en**, where **de** & **en** are the language codes for **German** & **English**. So, for any general language pair whose source language is **$src** and the target language is **$tgt**, the language data subdirectory should be named as **$src-$tgt**, and it should contain the following files (train, validation & test): + * train.$src-$tgt.$src.processed + * train.$src-$tgt.$tgt.processed + * valid.$src-$tgt.$src.processed + * valid.$src-$tgt.$tgt.processed + * test.$src-$tgt.$src.processed + * test.$src-$tgt.$tgt.processed + +* The **models** directory consists of trained models for the respective language pairs and also follows the same structure of subdirectories as the **data** directory. For example, **models/de-en** will contain trained models for the **German-English** language pair. + +* The following commands were used to install dependencies for the project: + ```bash + $ git clone https://github.com/RedHenLab/Neural-Machine-Translation.git + $ virtualenv myenv + $ source myenv/bin/activate + ``` + +* The virtual environment activation command for Windows is as follows: + ```bash + $ myenv\Scripts\activate + ``` + + After activating the virtual environment, run the following: + ```bash + $ pip install -r Neural-Machine-Translation/requirements.txt + ``` + +* **Note** that the virtual environment (myenv) created using the virtualenv command mentioned above should be of **Python2**. + +## Data Preparation and Preprocessing + +Please note that these data preparation steps have to be done manually as we are dealing with a Multilingual system, and each language pair might have different sources of data. For instance, I used many different data sources like europarl, newscommentary, commoncrawl & other open-source datasets. One can have a look at the shared task on Machine Translation, i.e., WMT, to get better datasets. I wrote a bash script that can be used to process & prepare datasets for MT. The following steps can be used to prepare a dataset for MT: +1) First copy the raw dataset files in the language ($src-$tgt) subdirectory of the data directory in the following format: + * train.$src-$tgt.$src + * train.$src-$tgt.$tgt + * valid.$src-$tgt.$src + * valid.$src-$tgt.$tgt + * test.$src-$tgt.$src + * test.$src-$tgt.$tgt + +2) Now create an empty directory named $src-$tgt in the Neural-Machine-Translation/subword_nmt directory. Copy the file named "prepare_data.sh" into the language subdirectory for which we need to prepare the dataset. Then use the following commands to process the dataset for training: + ```bash + bash prepare_data.sh $src $tgt + ``` + After this process, clear the entire language directory & just keep *.processed files. Your processed dataset is ready!! + +## Training + +To train a model on CASE HPC, one needs to run the train.sh file placed in the Neural-Machine-Translation/scripts folder. The parameters for training are kept such that a model can be efficiently trained for any newly introduced language pair, but one needs to tune the parameters according to the dataset. The prerequisite for training a model is that the parallel data as described in the **Installation** section should be residing in the concerned language pair directory in the data folder. The trained models will be saved in the language pair directory in the models folder. To train a model on CASE HPC, run the following command: + + ```bash + cd Neural-Machine-Translation/scripts + sbatch train.sh + # For example, to train a model for German->English one should type the following command + sbatch train.sh de en + ``` +After training, the trained model will be saved in the language ($src-$tgt) subdirectory in the models directory. The saved model would be something like "model_15.pt," and it should be renamed to "model_15_best.pt." + +## Translation +This project supports the translation of both normal text files and news transcripts in any supported language pair. +To translate any input news transcript, run the following commands: + ```bash + cd Neural-Machine-Translation/scripts + sbatch translate.sh 0 + ``` +To translate any normal text file, run the following commands: + ```bash + cd Neural-Machine-Translation/scripts + sbatch translate.sh 1 + ``` +**Note that the output translated file will be saved in the same directory containing the input file with a ".pred" string appended to the name of the input file.** + +## Evaluation of the trained model +For evaluation, generate a translation of any source test corpora. Now, we need to test its efficiency against the original target test corpus. For this, we use the multi-bleu.perl script residing in the scripts directory, which measures the corpus BLEU score. Usage instructions: +```bash +perl scripts/multi-bleu.perl $reference-file < $hypothesis-file +``` + +## Acknowledgements + +* [Google Summer of Code 2018](https://summerofcode.withgoogle.com/) +* [Red Hen Lab](http://www.redhenlab.org/) +* [OpenNMT-py](https://github.com/OpenNMT/OpenNMT-py) +* [An Actor-Critic Algorithm for Sequence Prediction](https://arxiv.org/pdf/1607.07086) +* [Europarl](http://www.statmt.org/europarl/) +* [Moses](https://github.com/moses-smt/mosesdecoder)