Órla Ní Loinsigh be6ebb2ec8 Initial commit | vor 2 Jahren | |
---|---|---|
res | vor 2 Jahren | |
test | vor 2 Jahren | |
test-res | vor 2 Jahren | |
toolchain | vor 2 Jahren | |
.gitignore | vor 2 Jahren | |
Dockerfile | vor 2 Jahren | |
README.md | vor 2 Jahren | |
entrypoint.sh | vor 2 Jahren | |
gunicorn.conf.py | vor 2 Jahren | |
requirements.txt | vor 2 Jahren | |
toolchains.cfg | vor 2 Jahren |
A rewrite of the TM-to-TMX and Doc-to-TMX STÓR (formerly NRS) toolchains.
If not already present, libreoffice
is available in Debian/Ubuntu repositories through apt, i.e.:
apt install libreoffice
All the following blocks of instructions will assume you are starting from a specific base directory, which will be referred to as $BASE
. Changes to working directory will be noted as needed.
While the dependencies above may have needed root/sudo to install, the remaining setup should be done as a regular user.
cd <some directory>
export BASE=`pwd`
Firstly, check out all repositories; from the base directory run:
cd $BASE
git clone https://github.com/danielvarga/hunalign
git clone https://opengogs.adaptcentre.ie/Oniloinsigh/stor-toolchain
git clone <location of toolchain-internal>
cd $BASE/hunalign/src/hunalign
make
cd $BASE/toolchain
python3 -m venv venv
. venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
cp res/langdetect_models/ga venv/lib/python3.8/site-packages/langdetect/profiles/
cd $BASE/toolchain
cp -r $BASE/toolchain-internal/res/abbreviations/ ./res/
cp -r $BASE/toolchain-internal/res/dictionaries/ ./res/
cp -r $BASE/toolchain-internal/test-res/integration/ ./test-res/
export PYTHONPATH=$BASE/toolchain:$PYTHONPATH
export HUNALIGNPATH=$BASE/hunalign/src/hunalign/hunalign
export PDFTOTEXTPATH=$(which pdftotext)
export LIBREOFFICEPATH=$(which libreoffice)
Note that a few of the tests around text extraction are a bit slow due to a forced sleep (in turn, this is due to a limitation of LibreOffice that apparently cannot be got around). However, the rest should run relatively quickly.
To run the tests:
cd $BASE/toolchain
coverage run -m unittest discover -s test
Each toolchain consists of a set of components run in sequence. Each individual component may also be run stand-alone. Not all components are run in any given run of a toolchain; different components may be chosen depending on input filetype, whether the data is parallel or monolingual, the condition of the data, etc.
Parsers are used to extract text from aligned input types (i.e. translation memory files).
Three types of aligned file types are accepted.
All parsers have the same usage.
python toolchain/parsers/{sdltm_parser|tmx_parser|xliff_parser}.py lang_src lang_tgt input_path output_path_src output_path_tgt
Where:
lang_src
: ISO 639-1 code of source language; variants are accepted but not mandatorylang_tgt
: ISO 639-1 code of target language; variants are accepted but not mandatoryinput_path
: path to input fileoutput_path_src
: path to output file of source languageoutput_path_tgt
: path to output file of target languageAll parsers output two plaintext files, one for the source language and one for the target. Language variant information, if present in the input, will not be preserved.
Extractors are used to extract text from unaligned input types (i.e. raw corpus documents). They each expect a single document at a time. The language of the text to be extracted need not be specified.
There are three types of extractor.
PlainTextExtractor
for plaintext files (in reality, this is just a file copy)EditableTextExtractor
for other editable files (.doc, .docx, .odt, .rtf)PdfTextExtractor
for PDFs (note that this currently only works with PDFs that contain a text layer)Because the plain text extractor is a trivial file copy, it does not have its own main. Each of the other extractors has a system dependency whose location must be communicated to it.
python toolchain/extractors/editable_text_extractor.py libreoffice_path input_path output_path
python toolchain/extractors/pdf_text_extractor.py pdftotext_path input_path output_path
Where:
libreoffice_path
: path to libreofficepdftotext_path
: path to pdftotextinput_path
: path to input fileoutput_path
: path to output fileAll extractors output a single plaintext file. Note that no normalization, segmentation, sentence-splitting etc. have been done in this step; it is simply text extraction alone.
The unicode normalizer may be used to perform unicode normalization on plaintext files. It resolves both NFC and NFD schemes to NFC. It also optionally supports configurable character substitution.
To perform basic normalization, the tool may be run from the command line.
python toolchain/normalizer/unicode_normalizer.py input_path output_path
Where:
input_path
: input plaintext fileoutput_path
: output plaintext fileTo configure character substitution, it is easiest to create a UnicodeNormalizer
object directly.
UnicodeNormalizer().normalize(input_path, output_path, custom_subtitutions)
Where:
custom_subtitutions
is a list of tuples; for each tuple, the first is the character to be replaced and the second is the substitutionCharacter substitution is a simple character/string replacement; regex replacement is not supported. Note that two character substitutions are configured by default and may not be overridden:
Original | Substitution | Notes |
---|---|---|
\ufeff | [empty] | Byte Order Mark (BOM) |
ı́ | í | NFD i with accent, uncaught by the unicodedata library as it combines the dotless i |
The normalizer outputs a single plaintext file encoded as UTF-8 NFC.
Language can be detected on a string or on an entire file. If detecting the language of an entire file, not every line is scanned, as the process is quite slow. Instead, a number of lines is read at the start of the file, and after that lines are sampled. This can be configured using the command-line arguments detailed below.
The language detector is a wrapper around langdetect, which is itself is a Python port of Nakatani Shuyo's language-detection.
python toolchain/common/language_detector.py [--file] [--min_file_line_length MIN_FILE_LINE_LENGTH] [--min_initial_lines MIN_INITIAL_LINES] [--sampling_interval SAMPLING_INTERVAL] input_line
Where:
file
: the input is a filename, i.e. the language is to be detected on the whole file; default falsemin_file_line_length
: if detecting in a file, only check the language of lines with this many characters or more; default 0min_initial_lines
: if detecting in a file, search this number of lines at the start of the file; default 50sampling_interval
: if detecting in a file, sample this number of lines throughout the file; default 100input_line
: the input string or filenameThe detected language will be output in the form of an ISO 639-1 two-letter code, without variants, to stdout.
Sentence splitters are used to reconstruct sentence boundary information. They take in plain text files as input and create files with a single sentence on each line.
There are two types of sentence splitter.
EditableSentenceSplitter
for text that originated from editable files (.doc, .docx, .odt, .rtf, .txt)PdfSentenceSplitter
for text that originated from PDFsThe main distinction between them is that for editable files, line endings are assumed to also be sentence endings, whereas for PDFs lines may end anywhere in a sentence.
Both splitters require lists of abbreviations in the source and target languages. Abbreviation lists must come in the form of TSV files. Each line in an abbreviations file will consist of three fields.
The first field is the abbreviation itself. This is case-sensitive, and should feature no terminal full stop.
The second is the expansion of the acronym. This is present only as a convenience, and is not strictly used by the splitters.
The third field is an optional boolean field. A True value indicates that the abbreviation expects to be followed by another word. If absent, it defaults to False.
Example snippet from an English abbreviation list:
Aug\tAugust
Co\tCounty\tTrue
Dr\tDoctor\tTrue
etc\tet cetera
Ms\tMs\tTrue
SI\tStatutory Instrument
Because abbreviation lists are language-specific, a sentence splitter should only be run on files whose language is known.
python toolchain/splitters/{editable_sentence_splitter|pdf_sentence_splitter}.py abbreviations_path input_path output_path
Where:
abbreviations_path
: path to abbreviations file appropriate to language of fileinput_path
: path to input fileoutput_path
: path to output fileBoth splitters output a single plaintext file with a single sentence to a line. Fragments of sentences that do not obviously belong to any sentence (e.g. section headings) will ideally also be output on their own line.
Note that there is room for error here, as all determinations are ultimately based on assumptions. This is particularly the case for PDFs.
The document aligner examines lists of files whose languages have been identified and attempts to determine which files correspond to one another.
Document alignment should be performed on files whose language is known and whose lines have been sentence-split.
python toolchain/docalign/document_aligner.py file_list_path_src file_list_path_tgt output_dir
Where:
file_list_path_src
: path to file listing source documentsfile_list_path_tgt
: path to file listing target documentsoutput_dir
: path to directory to write results and artefacts to--refalignments
: path to reference alignments file for evaluationNot all documents are guaranteed to be aligned. The aligner will return three lists:
alignments
: a list of tuples/pairs of aligned documentsunmatched_src
: a list of documents in the source language for which no match could be foundunmatched_tgt
: a list of documents in the target language for which no match could be foundThese three lists will also be written to file.
The sentence aligner aligns a single pair of files at a sentence level. The aligner is a wrapper around hunalign.
The wrapper adds a timeout to the external hunalign call, and splits the output into two target files.
The hunalign project must be built and available somewhere as a binary in order to run this component.
The sentence aligner should be run only on a single pair of files that are known to correspond to one another, i.e. after document alignment.
python toolchain/sentalign/sentence_aligner.py [--subprocess_timeout] hunalign dictionary input_path_src input_path_tgt output_path_src output_path_tgt output_artefact_dirname
Where:
hunalign
: path to hunalign binarydictionary
: path to dictionary fileinput_path_src
: path to source input text fileinput_path_tgt
: path to target input text fileoutput_path_src
: path to source output text fileoutput_path_tgt
: path to target output text fileoutput_artefact_dirname
: path to output artefact dir--subprocess_timeout
: timeout limit in seconds for running hunalign subprocessOutput will be in the form of two plaintext files, one each for source and target, where the lines correspond to one another. Note that lines are not guaranteed to be non-empty on either side.
The monolingual text cleaner may be used to remove unwanted lines from a plaintext file. The file should be in a single language throughout. This cleaner removes empty lines and lines that are not of the expected language.
python toolchain/cleaners/monolingual_cleaner.py [--langdetect_threshold LANGDETECT_THRESHOLD] [--rejected_line_delimiter REJECTED_LINE_DELIMITER] lang input_path output_path_retained output_path_rejected
Where:
langdetect_threshold
: the minimum length in characters that a line must be in order for language detection to be performed; default 40rejected_line_delimiter
: a string used to delimit fields in the output report; default "@@@"lang
: ISO 639-1 code of expected languageinput_path
: input plaintext fileoutput_path_retained
: output file of accepted linesoutput_path_rejected
: output file of rejected linesThis cleaner outputs two files. One will consist of the lines that are found acceptable, and the other a form of structured report file detailing lines that were rejected and why.
The post-alignment text cleaner may be used to remove unwanted pairs of lines from parallel files whose languages are known. The files should be segmented and aligned. This cleaner removes line pairs where either source or target line is empty, not of the expected language, or does not contain any alphanumeric characters.
python toolchain/cleaners/post_alignment_cleaner.py [--langdetect_threshold LANGDETECT_THRESHOLD] [--rejected_line_delimiter REJECTED_LINE_DELIMITER] lang_src lang_tgt input_path_src input_path_tgt output_path_src output_path_tgt output_path_rejected
Where:
langdetect_threshold
: the minimum length in characters that a line must be in order for language detection to be performed; default 40rejected_line_delimiter
: a string used to delimit fields in the output report; default "@@@"lang_src
: ISO 639-1 code of expected language of source filelang_tgt
: ISO 639-1 code of expected language of target fileinput_path_src
: input plaintext file of source languageinput_path_tgt
: input plaintext file of target languageoutput_path_src
:output file of accepted source linesoutput_path_tgt
: output file of accepted target linesoutput_path_rejected
output file of rejected linesThis cleaner will output three files. Two will consist of the lines that were found acceptable for each language, and will still be aligned. The third will consist of a form of structured report file detailing line pairs that were rejected and why.
The TMX creator creates a single TMX file from a pair of plaintext files. The input files are expected to have the same number of lines, and to be aligned. The language of each input file must be known.
The TMX creator requires a Jinja template. A basic one is provided as a project resource under res/tmx_templates/generic_tmx_template.xml
.
Command-line usage of this tool demands a large number of mandatory arguments, in order to comply with the TMX specification.
python toolchain/writers/tmx_creator.py template_path input_path_src input_path_tgt output_path adminlang datatype o_tmf segtype srclang tgtlang
Where:
template_path
: path to Jinja template fileinput_path_src:
: path to input file for source languageinput_path_tgt
: path to input file for target languageoutput_path:
: path to write TMX file toadminlang
: ISO 639-1 code of administrative languagedatatype
: type of data containedo_tmf
: original translation memory formatsegtype
: segmentation typesrclang
: ISO 639-1 code of source languagetgtlang
: ISO 639-1 code of target languageAlternatively, it is possible to create a TmxCreator
object and pass the mandatory arguments in as a dictionary. The additional_args
dictionary may also be used to pass any other non-mandatory arguments that might be required by a custom template.
TmxCreator().create(template_path, input_path_src, input_path_tgt, output_path, additional_args)
A single TMX file is generated. It should be compliant with TMX schema 1.4b. In addition to the mandatory attributes listed above, a creation date attribute is also generated in the file header.
The alternative to running components individually is to run them in a toolchain. There are two toolchains:
TM-to-TMX
for processing file types that are already aligned and creating cleaned TMX files from themDoc-to-TMX
for creating TMX files from raw corpus documentsAlthough they work differently, both toolchains are called the same way.
cd $BASE/toolchain
python toolchain/{tm_to_tmx_processor|doc_to_tmx_processor}.py id input_dir artefact_dir output_dir config_path
Where:
id
: LR identifier; this will be used to generate filenamesinput_dir
: path to input directoryartefact_dir
: path to artefact directoryoutput_dir
: path to output directoryconfig_path
: path to config; a sample has been included as toolchains.cfg
The artefact_dir
and output_dir
need not exist already, but they must be somewhere that there are permissions to create.
TMX file(s) will be created in the output directory specified.
The TM-to-TMX toolchain creates a single file for each TM file found.
The Doc-to-TMX toolchain produces a single combined file for all the input it processed. In addition, it may produce monolingual text files for any documents that were unmatched at the end of document alignment. Whether these are retained for the source language, the target language, or both is configurable.
A simple Flask app is available to handle toolchain requests. To run the this for development, in addition to the above instructions, run:
cd $BASE/toolchain
export FLASK_APP=$BASE/toolchain/toolchain/toolchain_app
export FLASK_ENV=development
export FLASK_RUN_PORT=5001
python -m flask run
There are two endpoints, one for each toolchain. In a local setup, they will look something like this:
http://127.0.0.1:5001/tm
http://127.0.0.1:5001/doc
Again, the interface for both is the same. Request json:
{
'id': <id>,
'input': <input_dir>,
'artefact': <artefact_dir>,
'output': <output_dir>
}
Response json:
{
'file_infos': [<information about each file produced>],
'rejected': <no. of files rejected as unprocessable>,
'success': <ran without errors>
}
A toolchain will return a small amount of information about each file it produced. As indicated above, this will be returned in the form of a list of records.
{
'encoding': <character encoding>,
'format': <file format>,
'languages': [<two-letter ISO 639-1 language code>],
'linguality_type': <bilingual|monolingual>,
'multilinguality_type': <parallel|comparable|...>,
'size': <size>,
'size_unit': <size unit>
}
Sample response:
{
'file_infos': [{
'encoding': 'utf8',
'format': 'tmx',
'languages': ['en', 'ga'],
'linguality_type': 'bilingual',
'multilinguality_type': 'parallel',
'size': 35,
'size_unit': 'translation_units'
}],
'rejected': 0,
'success': True
}