Importing and Exporting Unisyn Information

Enrichment Files

This document assumes access to the GitHub unisyn_spade repository.

Using the scripts check_rule_application.py and lex_enrich.py two different enrichment files can be generated:

check_rule_application.py

this enrichment script generates a file called rule_applications.csv which always contains the following columns

  1. word: the word that rules are being applied to
  2. UnisynPrimStressedVowel1: this is the pre-rule symbol for the primary stressed vowel
  3. UnisynPrimStressedSyllable1: this is the collection of pre-rule symbols for the primary stressed syllable
  4. AnyRulesApply: true if any rules were applied to the word, false otherwise

After these 4 columns, there are columns with column names <rule>_applied and boolean values which indicate whether <rule> was applied to the word. This is across accents — so if a rule is applied to any primary stressed vowel in any accent for a particular word, then that column will have a “True” value for that word. For example, the word “abjure” has the rule “do_scots_long” applied to it for the Edinburgh accent, and so in rule_applications.csv it has a “True” in that column.

lex_enrich.py

This script takes as arguments two accents that you would like to compare. It then generates a file called <accent>_comparison.csv for each accent, which contains

  1. Word: the word rules are being applied to
  2. UnisynPrimStressedVowel2_<accent>: this is the global keysymbol for the primary stressed vowel
  3. UnisynPrimStressedVowel3_<accent>: this is the post-rule keysymbol for the primary stressed vowel
  4. UnisynPrimStressedVowel3_XSAMPA_<accent>: this is the SAMPA character that the post-rule symbol maps to
  5. AnyRuleApplied_<accent>: this is true if any rule applies for that accent. Note this is different from the column with the same name in rule_applications.csv. This column is only true if any rule applies in the accent

After these columns there is one column for each rule that is ever applied just to the accent in question.

 

Both of these scripts are located in the unisyn_spade/unisyn_scripts folder.

Importing

In order to import Raleigh, you need the Raleigh corpus on your machine (either locally or in an accessible mounted drive). Once you have this, you also need to clone PolyglotDB, install the requirements, and start the databases by following the instructions included with the software.

After starting up PolyglotDB, you can run the script import_corpus.py with the required arguments (run

python3 import_corpus.py --help

to see what these are). This will import Raleigh into PolyglotDB.

Enriching and Exporting

The intended use case for analyzing vowels is formant analysis, for which the Unisyn information will be useful.  So before enriching with the Unisyn files,  you probably want to encode formants. See Vowel Formants for instructions on how to do this.  (However, this isn’t strictly necessary in order to run the steps below.)

Enriching and exporting the data is done by the script export_raleigh.py. Again, run it with the “–help” option to see the required arguments and how to use it. This script gives you the following columns for all primary stressed vowels of all words that are in the Unisyn lexicon:

  1. word_label
  2. phone_label
  3. phone_begin
  4. phone_end
  5. speaker
  6. file
  7. formants
  8. UnisynPrimStressedVowel1: this is the pre-rule keysymbol (or global keysymbol if no pre-rule keysymbol exists) for the primary stressed vowel
  9. UnisynPrimStressedSyl1
  10. UnisynPrimStressedVowel2_sca
  11. UnisynPrimStressedVowel3_sca
  12. UnisynPrimStressedVowel3_xsampa_sca
  13. anyruleapplied

After this, it has a column for every rule that could be applied for the South Carolina (SCA) accent

Both of these scripts are located in the unisyn_spade/polyglot_scripts

 

Note on multiple transcriptions

In the Unisyn lexicon, there are a number of words with multiple transcriptions (for example, “again” has two entries, one with a monophthong for the second vowel, one with a diphthong). However, structurally PolyglotDB only supports having one enrichment value per word. Currently, it will take the last value. This means if you use an enrichment file with multiple lines starting with the same word, the data that ends up getting encoded is the data in the last line starting with that word. Thus if there is a particular pronunciation of a multiple-entry word that you would like to have encoded, you need to order it last in the file.