No Description

Órla Ní Loinsigh be6ebb2ec8 Initial commit 1 year ago
res be6ebb2ec8 Initial commit 1 year ago
test be6ebb2ec8 Initial commit 1 year ago
test-res be6ebb2ec8 Initial commit 1 year ago
toolchain be6ebb2ec8 Initial commit 1 year ago
.gitignore be6ebb2ec8 Initial commit 1 year ago
Dockerfile be6ebb2ec8 Initial commit 1 year ago
README.md be6ebb2ec8 Initial commit 1 year ago
entrypoint.sh be6ebb2ec8 Initial commit 1 year ago
gunicorn.conf.py be6ebb2ec8 Initial commit 1 year ago
requirements.txt be6ebb2ec8 Initial commit 1 year ago
toolchains.cfg be6ebb2ec8 Initial commit 1 year ago

README.md

STÓR Toolchain

A rewrite of the TM-to-TMX and Doc-to-TMX STÓR (formerly NRS) toolchains.

Dependencies

  • Python 3.8+ (developed/tested with Python 3.8.10)
  • python3-venv (for other Python dependencies)
  • libreoffice (text extraction)
  • pdftotext (text extraction)
  • A C++ compiler (to build hunalign)
  • hunalign (sentence alignment; see below)
  • Access to certain input resources as required by specific components (see below)

If not already present, libreoffice is available in Debian/Ubuntu repositories through apt, i.e.:

apt install libreoffice

Setup

All the following blocks of instructions will assume you are starting from a specific base directory, which will be referred to as $BASE. Changes to working directory will be noted as needed.

While the dependencies above may have needed root/sudo to install, the remaining setup should be done as a regular user.

cd <some directory>
export BASE=`pwd`

Check out repositories

Firstly, check out all repositories; from the base directory run:

cd $BASE
git clone https://github.com/danielvarga/hunalign
git clone https://opengogs.adaptcentre.ie/Oniloinsigh/stor-toolchain
git clone <location of toolchain-internal>

Build hunalign

cd $BASE/hunalign/src/hunalign
make

Install python dependencies

cd $BASE/toolchain
python3 -m venv venv
. venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
cp res/langdetect_models/ga venv/lib/python3.8/site-packages/langdetect/profiles/

Copy internal resources

cd $BASE/toolchain
cp -r $BASE/toolchain-internal/res/abbreviations/ ./res/
cp -r $BASE/toolchain-internal/res/dictionaries/ ./res/
cp -r $BASE/toolchain-internal/test-res/integration/ ./test-res/

Export environment variables

export PYTHONPATH=$BASE/toolchain:$PYTHONPATH
export HUNALIGNPATH=$BASE/hunalign/src/hunalign/hunalign
export PDFTOTEXTPATH=$(which pdftotext)
export LIBREOFFICEPATH=$(which libreoffice)

Testing

Note that a few of the tests around text extraction are a bit slow due to a forced sleep (in turn, this is due to a limitation of LibreOffice that apparently cannot be got around). However, the rest should run relatively quickly.

To run the tests:

cd $BASE/toolchain
coverage run -m unittest discover -s test

Components

Each toolchain consists of a set of components run in sequence. Each individual component may also be run stand-alone. Not all components are run in any given run of a toolchain; different components may be chosen depending on input filetype, whether the data is parallel or monolingual, the condition of the data, etc.

Parsers

Parsers are used to extract text from aligned input types (i.e. translation memory files).

Parser Types

Three types of aligned file types are accepted.

  • SDLTM
  • TMX (version 1.4)
  • XLIFF (versions 1.2 and 2.0)

Usage

All parsers have the same usage.

python toolchain/parsers/{sdltm_parser|tmx_parser|xliff_parser}.py lang_src lang_tgt input_path output_path_src output_path_tgt

Where:

  • lang_src: ISO 639-1 code of source language; variants are accepted but not mandatory
  • lang_tgt: ISO 639-1 code of target language; variants are accepted but not mandatory
  • input_path: path to input file
  • output_path_src: path to output file of source language
  • output_path_tgt: path to output file of target language

Output

All parsers output two plaintext files, one for the source language and one for the target. Language variant information, if present in the input, will not be preserved.

Extractors

Extractors are used to extract text from unaligned input types (i.e. raw corpus documents). They each expect a single document at a time. The language of the text to be extracted need not be specified.

Extractor Types

There are three types of extractor.

  • PlainTextExtractor for plaintext files (in reality, this is just a file copy)
  • EditableTextExtractor for other editable files (.doc, .docx, .odt, .rtf)
  • PdfTextExtractor for PDFs (note that this currently only works with PDFs that contain a text layer)

Usage

Because the plain text extractor is a trivial file copy, it does not have its own main. Each of the other extractors has a system dependency whose location must be communicated to it.

python toolchain/extractors/editable_text_extractor.py libreoffice_path input_path output_path
python toolchain/extractors/pdf_text_extractor.py pdftotext_path input_path output_path

Where:

  • libreoffice_path: path to libreoffice
  • pdftotext_path: path to pdftotext
  • input_path: path to input file
  • output_path: path to output file

Output

All extractors output a single plaintext file. Note that no normalization, segmentation, sentence-splitting etc. have been done in this step; it is simply text extraction alone.

Unicode Normalizer

The unicode normalizer may be used to perform unicode normalization on plaintext files. It resolves both NFC and NFD schemes to NFC. It also optionally supports configurable character substitution.

Usage

To perform basic normalization, the tool may be run from the command line.

python toolchain/normalizer/unicode_normalizer.py input_path output_path

Where:

  • input_path: input plaintext file
  • output_path: output plaintext file

To configure character substitution, it is easiest to create a UnicodeNormalizer object directly.

UnicodeNormalizer().normalize(input_path, output_path, custom_subtitutions)

Where:

  • custom_subtitutions is a list of tuples; for each tuple, the first is the character to be replaced and the second is the substitution

Character substitution is a simple character/string replacement; regex replacement is not supported. Note that two character substitutions are configured by default and may not be overridden:

Original Substitution Notes
\ufeff [empty] Byte Order Mark (BOM)
ı́ í NFD i with accent, uncaught by the unicodedata library as it combines the dotless i

Output

The normalizer outputs a single plaintext file encoded as UTF-8 NFC.

Language Detector

Language can be detected on a string or on an entire file. If detecting the language of an entire file, not every line is scanned, as the process is quite slow. Instead, a number of lines is read at the start of the file, and after that lines are sampled. This can be configured using the command-line arguments detailed below.

The language detector is a wrapper around langdetect, which is itself is a Python port of Nakatani Shuyo's language-detection.

Usage

python toolchain/common/language_detector.py [--file] [--min_file_line_length MIN_FILE_LINE_LENGTH] [--min_initial_lines MIN_INITIAL_LINES] [--sampling_interval SAMPLING_INTERVAL] input_line

Where:

  • file: the input is a filename, i.e. the language is to be detected on the whole file; default false
  • min_file_line_length: if detecting in a file, only check the language of lines with this many characters or more; default 0
  • min_initial_lines: if detecting in a file, search this number of lines at the start of the file; default 50
  • sampling_interval: if detecting in a file, sample this number of lines throughout the file; default 100
  • input_line: the input string or filename

Output

The detected language will be output in the form of an ISO 639-1 two-letter code, without variants, to stdout.

Sentence Splitters

Sentence splitters are used to reconstruct sentence boundary information. They take in plain text files as input and create files with a single sentence on each line.

Types

There are two types of sentence splitter.

  • EditableSentenceSplitter for text that originated from editable files (.doc, .docx, .odt, .rtf, .txt)
  • PdfSentenceSplitter for text that originated from PDFs

The main distinction between them is that for editable files, line endings are assumed to also be sentence endings, whereas for PDFs lines may end anywhere in a sentence.

Abbreviation lists

Both splitters require lists of abbreviations in the source and target languages. Abbreviation lists must come in the form of TSV files. Each line in an abbreviations file will consist of three fields.

The first field is the abbreviation itself. This is case-sensitive, and should feature no terminal full stop.

The second is the expansion of the acronym. This is present only as a convenience, and is not strictly used by the splitters.

The third field is an optional boolean field. A True value indicates that the abbreviation expects to be followed by another word. If absent, it defaults to False.

Example snippet from an English abbreviation list:

Aug\tAugust
Co\tCounty\tTrue
Dr\tDoctor\tTrue
etc\tet cetera
Ms\tMs\tTrue
SI\tStatutory Instrument

Usage

Because abbreviation lists are language-specific, a sentence splitter should only be run on files whose language is known.

python toolchain/splitters/{editable_sentence_splitter|pdf_sentence_splitter}.py abbreviations_path input_path output_path

Where:

  • abbreviations_path: path to abbreviations file appropriate to language of file
  • input_path: path to input file
  • output_path: path to output file

Output

Both splitters output a single plaintext file with a single sentence to a line. Fragments of sentences that do not obviously belong to any sentence (e.g. section headings) will ideally also be output on their own line.

Note that there is room for error here, as all determinations are ultimately based on assumptions. This is particularly the case for PDFs.

Document Aligner

The document aligner examines lists of files whose languages have been identified and attempts to determine which files correspond to one another.

Usage

Document alignment should be performed on files whose language is known and whose lines have been sentence-split.

python toolchain/docalign/document_aligner.py file_list_path_src file_list_path_tgt output_dir

Where:

  • file_list_path_src: path to file listing source documents
  • file_list_path_tgt: path to file listing target documents
  • output_dir: path to directory to write results and artefacts to
  • --refalignments: path to reference alignments file for evaluation

Output

Not all documents are guaranteed to be aligned. The aligner will return three lists:

  • alignments: a list of tuples/pairs of aligned documents
  • unmatched_src: a list of documents in the source language for which no match could be found
  • unmatched_tgt: a list of documents in the target language for which no match could be found

These three lists will also be written to file.

Sentence Aligner

The sentence aligner aligns a single pair of files at a sentence level. The aligner is a wrapper around hunalign.

The wrapper adds a timeout to the external hunalign call, and splits the output into two target files.

The hunalign project must be built and available somewhere as a binary in order to run this component.

Usage

The sentence aligner should be run only on a single pair of files that are known to correspond to one another, i.e. after document alignment.

python toolchain/sentalign/sentence_aligner.py [--subprocess_timeout] hunalign dictionary input_path_src input_path_tgt output_path_src output_path_tgt output_artefact_dirname

Where:

  • hunalign: path to hunalign binary
  • dictionary: path to dictionary file
  • input_path_src: path to source input text file
  • input_path_tgt: path to target input text file
  • output_path_src: path to source output text file
  • output_path_tgt: path to target output text file
  • output_artefact_dirname: path to output artefact dir
  • --subprocess_timeout: timeout limit in seconds for running hunalign subprocess

Output

Output will be in the form of two plaintext files, one each for source and target, where the lines correspond to one another. Note that lines are not guaranteed to be non-empty on either side.

Monolingual Text Cleaner

The monolingual text cleaner may be used to remove unwanted lines from a plaintext file. The file should be in a single language throughout. This cleaner removes empty lines and lines that are not of the expected language.

Usage

python toolchain/cleaners/monolingual_cleaner.py [--langdetect_threshold LANGDETECT_THRESHOLD] [--rejected_line_delimiter REJECTED_LINE_DELIMITER] lang input_path output_path_retained output_path_rejected

Where:

  • langdetect_threshold: the minimum length in characters that a line must be in order for language detection to be performed; default 40
  • rejected_line_delimiter: a string used to delimit fields in the output report; default "@@@"
  • lang: ISO 639-1 code of expected language
  • input_path: input plaintext file
  • output_path_retained: output file of accepted lines
  • output_path_rejected: output file of rejected lines

Output

This cleaner outputs two files. One will consist of the lines that are found acceptable, and the other a form of structured report file detailing lines that were rejected and why.

Post-Alignment Text Cleaner

The post-alignment text cleaner may be used to remove unwanted pairs of lines from parallel files whose languages are known. The files should be segmented and aligned. This cleaner removes line pairs where either source or target line is empty, not of the expected language, or does not contain any alphanumeric characters.

Usage

python toolchain/cleaners/post_alignment_cleaner.py [--langdetect_threshold LANGDETECT_THRESHOLD] [--rejected_line_delimiter REJECTED_LINE_DELIMITER] lang_src lang_tgt input_path_src input_path_tgt output_path_src output_path_tgt output_path_rejected

Where:

  • langdetect_threshold: the minimum length in characters that a line must be in order for language detection to be performed; default 40
  • rejected_line_delimiter: a string used to delimit fields in the output report; default "@@@"
  • lang_src: ISO 639-1 code of expected language of source file
  • lang_tgt: ISO 639-1 code of expected language of target file
  • input_path_src: input plaintext file of source language
  • input_path_tgt: input plaintext file of target language
  • output_path_src:output file of accepted source lines
  • output_path_tgt: output file of accepted target lines
  • output_path_rejected output file of rejected lines

Output

This cleaner will output three files. Two will consist of the lines that were found acceptable for each language, and will still be aligned. The third will consist of a form of structured report file detailing line pairs that were rejected and why.

TMX Creator

The TMX creator creates a single TMX file from a pair of plaintext files. The input files are expected to have the same number of lines, and to be aligned. The language of each input file must be known.

The TMX creator requires a Jinja template. A basic one is provided as a project resource under res/tmx_templates/generic_tmx_template.xml.

Usage

Command-line usage of this tool demands a large number of mandatory arguments, in order to comply with the TMX specification.

python toolchain/writers/tmx_creator.py template_path input_path_src input_path_tgt output_path adminlang datatype o_tmf segtype srclang tgtlang

Where:

  • template_path: path to Jinja template file
  • input_path_src:: path to input file for source language
  • input_path_tgt: path to input file for target language
  • output_path:: path to write TMX file to
  • adminlang: ISO 639-1 code of administrative language
  • datatype: type of data contained
  • o_tmf: original translation memory format
  • segtype: segmentation type
  • srclang: ISO 639-1 code of source language
  • tgtlang: ISO 639-1 code of target language

Alternatively, it is possible to create a TmxCreator object and pass the mandatory arguments in as a dictionary. The additional_args dictionary may also be used to pass any other non-mandatory arguments that might be required by a custom template.

TmxCreator().create(template_path, input_path_src, input_path_tgt, output_path, additional_args)

Output

A single TMX file is generated. It should be compliant with TMX schema 1.4b. In addition to the mandatory attributes listed above, a creation date attribute is also generated in the file header.

Running end-to-end toolchains

The alternative to running components individually is to run them in a toolchain. There are two toolchains:

  • TM-to-TMX for processing file types that are already aligned and creating cleaned TMX files from them
  • Doc-to-TMX for creating TMX files from raw corpus documents

Usage

Although they work differently, both toolchains are called the same way.

cd $BASE/toolchain
python toolchain/{tm_to_tmx_processor|doc_to_tmx_processor}.py id input_dir artefact_dir output_dir config_path

Where:

  • id: LR identifier; this will be used to generate filenames
  • input_dir: path to input directory
  • artefact_dir: path to artefact directory
  • output_dir: path to output directory
  • config_path: path to config; a sample has been included as toolchains.cfg

The artefact_dir and output_dir need not exist already, but they must be somewhere that there are permissions to create.

Output

TMX file(s) will be created in the output directory specified.

The TM-to-TMX toolchain creates a single file for each TM file found.

The Doc-to-TMX toolchain produces a single combined file for all the input it processed. In addition, it may produce monolingual text files for any documents that were unmatched at the end of document alignment. Whether these are retained for the source language, the target language, or both is configurable.

Running the Flask app

A simple Flask app is available to handle toolchain requests. To run the this for development, in addition to the above instructions, run:

cd $BASE/toolchain
export FLASK_APP=$BASE/toolchain/toolchain/toolchain_app
export FLASK_ENV=development
export FLASK_RUN_PORT=5001
python -m flask run

There are two endpoints, one for each toolchain. In a local setup, they will look something like this:

http://127.0.0.1:5001/tm
http://127.0.0.1:5001/doc

Again, the interface for both is the same. Request json:

{
  'id': <id>,
  'input': <input_dir>,
  'artefact': <artefact_dir>,
  'output': <output_dir>
}

Response json:

{
  'file_infos': [<information about each file produced>],
  'rejected': <no. of files rejected as unprocessable>,
  'success': <ran without errors>
}

A toolchain will return a small amount of information about each file it produced. As indicated above, this will be returned in the form of a list of records.

{
  'encoding': <character encoding>,
  'format': <file format>,
  'languages': [<two-letter ISO 639-1 language code>],
  'linguality_type': <bilingual|monolingual>,
  'multilinguality_type': <parallel|comparable|...>,
  'size': <size>,
  'size_unit': <size unit>
}

Sample response:

{
  'file_infos': [{
    'encoding': 'utf8',
    'format': 'tmx',
    'languages': ['en', 'ga'],
    'linguality_type': 'bilingual',
    'multilinguality_type': 'parallel',
    'size': 35,
    'size_unit': 'translation_units'
  }],
  'rejected': 0,
  'success': True
}