| Home CBM ASCII-X BASIC C128 2MHz Border BASIC 7.80 BASIC 8 CPM Digimaster 128 Fast Serial for uIEC Games Interlace JiffySoft128 Keyboard Scan Media Player 128 Orig Interlace RGBI to S-Video RGBI to SCART RGBI to VGA RGB Conversion SAM 128 Alpha Long Common Phonemes Reciter SAM Technical SID Player 128 VDC Interlace Mp128alpha N-progs D64plus Disk Escape Codes Hardware PCxface PETSCII Pet2asc Futurama IBM PC-AT Contact Games Glossary Hall of fame Hall of shame Miscellaneous Privacy policy Programming Twisty puzzles |
This page attempts to describe the internal workings of SAM in "somewhat" human-readable fashion.
It assumes you are familiar with the concepts of time and arithmetic!
It assumes you know (or can reference) SAM's phonemes.
It assumes you are familiar with programming concepts in general,
but no specific knowledge of the Commodore 128, 6502 ML, nor BASIC is required.
The reader should also be familiar with simple audio concepts like amplitude, frequency, phase, power, and wavelength.
Knowing some phonetic terms (like diphthong and plosive) would help,
but see the main SAM documentation if you need more info on these terms.
Familiarity with digital sampling and playback would also be helpful, but not strictly required.
(This page is mainly notes for myself, but you may find it useful!) Reciter is an extra layer on top of SAM, so SAM is analyzed first.
On a more abstract level, the description of my theory probably violates some "laws" of phonetics, sampling theory, and/or information theory. In other words, it is how I think about SAM's speech and hope you find it helpful for your own understanding, but it may have some subtle "technical" flaws. (I'll produce a perfect theory if you pay me!) I strive to use accurate terms, however, SAM's BASIC wedge (on the original C64 version) and its documentation refer to some settings by inappropriate terms (calling wavelength PITCH and time delay SPEED, for example). So I generally don't use the terms of the original version, or if I do, I try to make it obvious (with all caps shown in the last sentence) and phrase them alongside the correct terms. So just realize that sometimes I may sound like an idiot who doesn't know the difference between wavelength and frequency... but I do understand, and I'm trying to dance around the misconceptions introduced by the original. SAM first translates input text (phonemes), through multiple steps, into a series of phones and, finally, into a series of F_Samples (for lack of a better term). These are used in the final stage to actually render audio on the computer's sound chip. Each "F_Sample" consists of: The base wavelength (or fundamental wavelength) is the inverse of the fundamental frequency. It is not directly rendered to the audio chip, however it affects the phase of all 3 formants (see below) that are rendered. It can also delay processing of a type code. The fundamental frequency controls the general pitch of speech (in contrast to the formants, which provide higher-frequency details). Note: I (mainly) use the term base wavelength, instead of frequency, because SAM does not manipulate the "fundamental" as a frequency, but as the reciprocal: time. I do not use a term like fundamental time period because this value actually does affect the pitch of speech (unlike some other SAM time periods which do not). Hopefully the term wavelength reminds you it is related to frequency/pitch and not duration/speed.
Each of the 3 Formants ("mouth", "throat", and undocumented) consists of two pieces of data: a frequency and a power/amplitude. Note that power is used for preliminary smoothing while amplitude is used for final rendering; I will use the term "volume" when I need to refer to both. Hopefully you see that each F_Sample has 6 values (besides type and base wavelength): F1.frequency, F1.volume, F2.frequency, F2.volume, F3.frequency, and F3.volume. You may be wondering, "What is a Formant"? If you have time, check out this Wikipedia article. The short answer is: a Formant is an "important" waveform (for human listeners) with a specific frequency and volume. In SAM, the first two formants have a sine waveform. The third formant has a square waveform. The important thing to know is that each formant (of a phoneme) is assigned a specific frequency and volume... and each phoneme / F_Sample has 3 formants ("mouth", "throat", and undocumented). The type code tells which method(s) are used to render the F_Sample. SAM will render an F_Sample with either (or both) of the following methods:
To clarify, I'll include some images which show the formants, base wavelength, and final output. For all of them, I used no stress on the phones and used SAM's default settings (in particular, PITCH=64 and SPEED=72). Although these images were captured from a Commodore 128 (in 40-column mode), SAM/Reciter does not generate graphs like these (it takes far too long to do in real-time). Let's start with an image showing the beginning of the NX phone: A few things you should note:
The base wavelength is calculated from three quantities. The first, most significant is the PITCH value. Second, a delta-wavelength value is added to this, based on Stress of the phoneme (this is usually a negative value, which means a stressed phoneme will have a shorter wavelength (higher frequency)). Since the phoneme is unstressed, and PITCH is 64, this gives a preliminary base wavelength of 64. The final quantity involved is the frequency of formant 1 ("mouth"). This value is divided by two and subtracted from the preliminary base wavelength to get the final base wavelength. Because the NX phone has a mouth frequency of 6, half that value (3) is subtracted from 64 to give a final base wavelength of 61. Mathematically, base wavelength = PITCH + delta_wavelength(stress) - ( F1.Frequency / 2 ) Note 1: the resulting value of wavelength is in units of about 150 microseconds (assuming recommended TIMEBASE setting), which I simply call a phase step. The approximate frequency, in Hertz, is 6690 / wavelength. For the NX example, this means a fundamental frequency of 6690 / 61 = 110 Hz (approximately). Note 2: the equation implies that PITCH + smallest used delta_wavelength should be greater than half the greatest "mouth" frequency (otherwise the wavelength will be less-than-or-equal-to zero and cause a wrap-around 256... for example, if the equation resulted in a value of -2, the wavelength would be 254 phase steps; a result of 0 would be 256 phase steps). If you use the greatest possible stress, then the smallest delta_wavelength is -32. If you use and the default KNOBS values, the greatest formant 1 frequency is 26. Thus PITCH - 32 > 26 / 2, or PITCH > 13 + 32, or PITCH > 45. If you used the maximum KNOBS "value-x" of 255, then maximum mouth frequency would be 51 (almost 26*2). In that case (and assuming most powerful stress) the relation becomes PITCH - 32 > 51 / 2, or PITCH > 25 + 32, or PITCH > 57. Note 3: the equation makes no sense in terms of dimensional analysis! Adding PITCH (which is really a wavelength) to delta_wavelength gives us a wavelength (no problem), but then subtracting a frequency is just crazy! You can't really add or subtract wavelengths (err, time periods) with frequencies... it's like adding apples and oranges. For fun, ask a physics professor what you get when you subtract 3 Hz from 61 seconds. The only reason it works (I'm guessing) is because the frequency is subtracted from the wavelength. So a higher mouth frequency will subtract more from the base wavelength than a moderate frequency. Subtracting more will generate a shorter fundamental wavelength... which means a higher fundamental frequency. I'm guessing this is some easy approximation to a more complex function. Next (second) is the beginning of the R phone. Because its "Mouth" frequency has a value of 18, the final base wavelength will be 64-18/2 = 55 phase steps. The base frequency will be about 6690 / 55 = 122 Hz (with the default PITCH and recommended TIMEBASE). Next is the beginning of the (single) AW phoneme (it gets translated into two phones). Because its "Mouth" frequency has a value of 26 (at the beginning = first phone), the base wavelength will be 64-26/2 = 51 phase steps. The base frequency would be about 6690 / 51 = 131 Hz.
Along the top of the previous images (above the fundamental wavelength), you should see little tick marks. These indicate where the next F_Sample will be queued (prepared for play back). Now things get a bit tricky! The next F_Sample won't fully begin until the (current) fundamental frequency completes its cycle. The (fixed) distance between the marks represent the 'requested' hold time for samples, as set by SPEED. With the default settings (SPEED slightly larger (longer) than PITCH (wavelength)), each F_Sample will alternate between 1 and 2 base wavelengths per sample. When a new F_Sample is queued, as indicated by the tick marks, SAM will immediately change all 3 formants to the new values of the next F_Sample. However, the base wavelength and "type" of the current F_Sample will remain in effect until the end of the current base wavelength. In each of the images above, all the F_Samples were the same (which makes it impossible to see when a new F_Sample begins). When SAM first starts building the F_Samples from your input text, the same F_Sample is repeated for the duration of the phone (each phoneme/phone has one or two possible durations: natural and stressed). After all phones in a phrase have been "exploded" into a series of repeated F_Samples, SAM will modify some (or all) of the F_Samples around the transition between phones. For example, instead of having an abrupt change of frequencies, volumes and base wavelength between two phones, the components (frequency, volume, base wavelength) will smoothly change over a period of several F_Samples. The following image shows the transition between the "M" and "IY" phones of the phonetic phrase "MIY". The above image shows the F_Samples about 100 milliseconds after the start of the "MIY" phrase -- near the transition between the "M" and "IY" phones. Looking at Formant 1, it should be obvious where the phase reset occurs at each new base wavelength. Although it is a bit difficult, you can also see where a new frequency (in formant 1) occurs about 25% into the third FULL base wavelength (at the point of third tick-mark). To make things easier for you, look at the amplitude of Formants 2 and 3... at the same point in time, you should see an obvious increase in amplitude. You can also see another increase in amplitude of those formants about 50% into the fourth FULL base wavelength (at the point of the fourth tick-mark). The import things to note are:
The 3 formants, on the other hand, are smoothed with a biased algorithm. For each pair of adjacent phones, this routine will compare a "dominance" value assigned to each phone. The more-dominant phone will have the same or (usually) fewer of its F_Samples altered than its less-dominant (subordinate) adjacent phone. For example, with a dominant phone before a subordinate one, the smoothing may begin during last 3 samples of the first phone and extend into the first 5 samples of the next phone. In particular, the last 3 samples of the first phone may be quite far from the center (for example if the phone was 12 F_Samples in duration, its center would be 6 F_Samples from the start of the next phone). Thus the more dominant phone has less of its samples smoothed, and the weaker phone has more samples altered. Anyway, once the biased start and end positions are determined, SAM once again generates a linear sequence of values between the start and end values (this is done 6 times: frequency and power of all three formants). During the smoothing process, each formant in the F_Samples is composed of a frequency and a power. After smoothing, the power in the F_Samples is converted to amplitude using a simple look-up table.
An important thing to note is that any PCM rendering (the currently discussed blended-method, or pure PCM) occurs at (approximately) 3x the "normal" rate used in the "pure" Formant rendering method (discussed above) -- assuming default TIMEBASE and NOISES settings. In order for me to adequately show this to you, I had to "expand" my images on the horizontal axis by a factor of three. Sadly, this made images of SAM's waveforms too "long" on the 40-column video screen. So I switched to 80-column mode for the following images... To get you started, first is an image of an "normal" (pure Formant) rendering of phone (M): Hopefully you can see the similarities with the very first image above (NX phone). In particular, the base wavelength is 61 phase steps, the F_Sample period is 72 phase steps, you can see "dimples" (a phase-reset) in Formant 1's sine wave at each new base wavelength, and all the F_Samples are the same (no change in frequency or amplitude). The main differences to note are:
Next is an image showing actual dual rendering. It is the beginning of the V phone. There are several differences with the "dual" method to note (most of them you can see, for the rest, you'll have to take my word):
The purpose of the dual rendering method (my theory, of course) is to generate frequencies higher than are possible with the standard 3-formant method. From experience, I must say it does an effective job, however it should be obvious from the image that it fouls up two of the main components of audio: pitch and speed. I discuss each next. The base wavelength represents the general pitch of a phone (or more precisely, an F_Sample within a phone). For the V phone in the above image, the base wavelength should have a value of 60. You can either peek into SAM's memory of F_Samples to verify this, or calculate it (with formula listed above) as 64-8/2 = 64-4 = 60 (the magic number 64 is the PITCH setting, and the magic number 8 is the "mouth" frequency of the "V" phone). However, as shown in the image, the actual wavelength is 7.2% less (55.67). This means a frequency/pitch slightly higher by about 7.8%. The formula to calculate a "dual" base wavelength (in units of phase-steps) is approximately:dual_base_wavelength = int(standard_base_wavelength * 3/4) + 8 * [int(standard_base_wavelength / 16) + 1] / 3 The first addend in the formula represents 75% of the standard (3-formant method) base wavelength, rounded down (the "int" function). The second addend of the formula represents the time period for n PCM bytes (8n PCM samples). The number of PCM bytes is: dual_PCM_bytes = int(standard_base_wavelength / 16) + 1 That is, 1/16 of the standard base wavelength (rounded down) plus one. The F_Sample period controls the rate of speech (I think of it as the F_Sample hold period). Although the base wavelength has an imperfect compensation factor of 3/4, the F_Sample period has no compensation! Because there is no compensation, and SAM "forgets" to update its F_Sample counter during the burst of a dual-rendered phone, an F_Sample rendered with the dual method will be longer than requested by the SPEED setting. How much longer is almost random! As discussed previously (standard rendering), each F_Sample will generally consist of one or two base wavelengths (with default settings, SAM alternates between one and two such that average equals the requested SPEED value). For what it is worth, the approximate formula for a "dual" F_Sample's time period is: (dual) F_Sample period = SPEED + n * 8 * dual_PCM_bytes / 3 where n is generally 1 or 2 (the number of dual_base_wavelengths which occur within an F_Sample). The above image of the V phone shows this very well. The number of PCM bytes is 4 because int( 60/16 ) + 1 = 3 + 1. The first F_Sample, which contains one base wavelength, has a period of 72 + 1 * 8 * 4 / 3 = 82.67 phase-steps. The second F_Sample, which contains two base wavelengths, has a period of 72 + 2 * 8 * 4 / 3 = 93.33 phase-steps. In summary, each F_Sample having a type of "dual rendering" will insert one or more bursts of (high-frequency) PCM samples. Each burst will occur after the first 75% of the phone is rendered with the standard (3-formant) method. The burst is rendered from one of five (256-byte) PCM tables (see NOISES "Y-values"), and will consist of a variable number of bytes (based on the standard base wavelength). The effective base wavelength will be shorter than standard (thus higher pitch), while the F_Sample's duration will be longer than standard (thus slower speech). See, I told you the dual method is tricky! One final technical note: the first PCM byte of the first "dual rendered" phone in a phrase will be a "random" value in the phone-specific PCM table. However, each subsequent PCM byte in the phone (or any following "dual rendered" phones) will be the next byte in the table. In effect, each burst will step through a PCM table just like the standard method steps through the phases of a sine wave. (However, the initial "psudo-phase" is random.) A second technical note: each PCM byte encodes 8 PCM samples... that is, 1-bit PCM. The amplitude of each bit (0 or 1) is fixed for the dual rendering method (SID volume 6 or 10, respectively). More importantly, each bit is rendered for about 55 microseconds (using 1MHz/NTSC with default NOISES values)... this is only approximately 1/3 of a phase-step. So the images and equations for dual rendering are approximate. Otherwise (more precision) would require an even more expanded image and more complex formulas. If for some reason you want a more precise equation, then substitute the value "3" in the equations with the value "149.5 / 55" (2.718).
Because it relies on a table of PCM bytes, PCM rendering generates a fixed set of frequencies (ignores PITCH / base wavelength). Unlike the dual and standard methods of rendering, it also generates audio for a fixed time period (ignores SPEED / F_Sample period). I like to think of this rendering method as "pure noise", however, each phoneme has a specific PCM table to use, a specific initial phase, and a specific duration... so really it is psudo-noise... if you examine the data in the 5 PCM tables, you will notice a different mixture of frequencies in each. Below is an image showing the transition between the pure PCM phone "CH" and the pure formant phone "IY". I didn't show all the PCM samples (because they would fill more than a full screen). As you can see (on the left third of the graph), the base wavelength and 3 formants are ignored (not updated or rendered) while the pure PCM samples are playing. The image shows a tick mark along the top for every PCM byte (every 8 PCM samples). This represents a reset of the F_Sample period... the image is a bit misleading because the F_Sample period is not actually reset on every PCM byte -- just after the last PCM byte. But the ultimate effect is the same: the F_Sample period and base wavelength (re)start as soon as the PCM samples finish playing. If you want to calculate the duration of a "pure PCM" F_Sample, take the duration value (number of PCM bytes) from the NOISES "X-Values" then multiply by 8. This is the number PCM samples (or H-Steps as I like to refer to each time period). The duration is then approximately 50 microseconds times the number of PCM samples. Approximately 50 because that value is 1/3 of a phase-step (as shown in my wide graphs). The actual duration (using default NOISES on NTSC machine at 1MHz) is closer to 58 microseconds times the number of PCM samples. For the CH example, per the NOISES listing,, the number of PCM bytes is 112. So 1*112*8 = 896 PCM samples. The duration is approximately 50x that value (in microseconds), or 44.8 milliseconds. A more accurate calculation is 896 * 58 = 51968 microseconds, or about 52 milliseconds. Either way, you can see the duration is much longer than a typical ("normal rendered") F_Sample which has a duration around 72*150 = 10800 microseconds or about 11 milliseconds. Don't forget that the actual phone consists of multiple F_Samples. To calculate the total duration of a phone, you normally multiply the F_Sample duration times the number of F_Samples in a phone (the duration column in the phoneme table). However, for "pure PCM" rendering, you have to divide the result by two. This is because whenever a "noisy" phone (pure PCM) is rendered, SAM will skip the next F_Sample in its sample buffer! All phones with a type of "pure PCM" are listed in the phoneme table with a duration of two. Some duration modification rules could extend the value, but (as far as I can tell) the result will always be a multiple of two. So SAM is (or rather should be) cutting off half the duration listed in its own tables for the "noisy" phone. Although the skipped F_Sample will be missing from the final output, it would be wrong to think it has no effect. The "missing" F_Sample will be used in the smoothing of F_Samples discussed earlier. By skipping it during the rendering stage, this results in a more abrupt transition. This seems silly to me, since the transition between PCM samples and 3-formant samples is already very abrupt! The similarities of "pure" PCM Rendering with "burst" PCM (Dual Rendering) are:
The differences between the "pure" and "burst" methods include:
As discussed under Formant Rendering, the next F_Sample in a phrase may (and usually does) begin anywhere within a base wavelength. The frequency and amplitude of the three formants will take effect immediately, however the new F_Sample's type and base wavelength won't take effect until the start of the next base wavelength (or if you prefer, until the end of the current base wavelength). I don't know if this is a bug, or by design. Because the base wavelength doesn't immediately change, this helps SAM speak with a (slightly) more consistent pitch. Because the type doesn't change immediately, this can result in "confused rendering"; an F_Sample may play with the new/correct frequency and amplitude but with the wrong type. If this is not a bug, then I would say it is late-stage smoothing (as opposed to the normal, pre-render smoothing). Below is an image of the transition between the "DH" and "IY" phones of the phrase "DHIY" to illustrate. The image shows two complete F_Samples and three complete base wavelengths (plus fractional parts near the left and right borders). The first complete F_Sample is from then end of the Dual Rendered ("burst" PCM) phone "DH". The second complete F_Sample is from the start of the 3-formant phone "IY". In the first (complete) base wavelength (part of phone "DH"), you can see the base wavelength has a value of 56 phase-steps, a type of "burst PCM", and zero amplitude for Formant 3. In the third (complete) base wavelength (part of phone "IY"), you can see the base wavelength has a value of 60 phase-steps, a type of "3-Formant" (no PCM samples), and a non-zero amplitude for Formant 3.Importantly, the central base wavelength shows "confused rendering" discussed above. Near the middle of the second (complete) base wavelength, the new F_Sample (for phone "IY") begins. All three fomants are updated immediately... it is not easy to see with Formants 1 and 2, but it should be obvious that Formant 3's amplitude increases from zero to some positive value. Although the new F_Sample updates the frequency and amplitude of the 3 formants instantly, the base wavelength remains at the old F_Sample value of 56 phase steps, and the type remains at the old F_Sample value of "burst PCM". Thus, you see a burst of 4 PCM bytes (32 PCM samples) at the end of the second (complete) base wavelength... even though SAM has begun the non-PCM (3-formant) rendering of the next F_Sample. I'll explain why this happens (if you haven't figured it out) because it is not obvious from the image. The F_Sample type is "burst PCM" when the second (complete) base wavelength begins. Thus SAM will render a burst of PCM sample's after 75% of the base wavelength... despite the fact that a new F_Sample (of non-PCM type) begins near the middle (50%) of the base wavelength. In other words, the "type" of a (new) F_Sample is delayed (ignored) until the start of the next base wavelength. This is a minor technical detail that, normally, I might not bother to mention and for which I definitely wouldn't construct a graphic image. However (as discussed under PCM Rendering), SAM should be cutting the last half of a "noisy" (pure PCM) phone. But because an F_Sample's type is not recognized (not "seen") by SAM until a new base wavelength begins, and because two F_Samples may "start" within a single base wavelength (only if you ignore the rule SPEED > PITCH), SAM may not see the first of a two-duration (two F_Samples) "noisy" phone. In this case, when the next base wavelength begins, SAM will render the second F_Sample as "noise" (pure PCM) and then skip the following (presumably non-noise) F_Sample. This (normally) just has the effect of cutting out a short transition (smoothing) F_Sample... or at least that is what I think the original authors intended! Unfortunately, there is a bug in the way this is coded in SAM... because the skipped F_Sample could be anything (and not part of the "noisy" phone as intended), SAM may skip the end-of-phrase! If SAM skips the end-of-phrase, then it will play all remaining "junk" in its F_Sample buffer(s), then wrap around to the beginning and play it again. At the very least, this will sound terrible. At worst, it may result in an infinite loop! It would only result in an infinite loop if every pass of the F_Sample buffer resulted in two "noisy" F_Samples occurring in a single base wavelength right before the end-of-phrase. On the one hand, this seems unlikely because (generally) the base wavelength and F_Sample period are separate variables, and tend to drift relative to each other. On the other hand, as the image of the pure PCM rendering shows, the end of any "noisy" phone (pure PCM) will force an alignment of base wavelength and F_Sample period. So I believe (but have not tested) that if any "noisy" phones appear prior to the final, buggy, "noisy" phone, then an infinite loop will happen.
I go into detail below, but here are the main things SAM does:
Now that the simple(!) work has been done, we can prepare to render audio. Basically we need to build (8) tables of F_Sample data. Each F_Sample consists of:
Briefly, the following actions occur:
Now all the F_Samples of the phrase are ready to be played! Note the value in "next F_Sample index" is the number of number of samples to play.
After the final loop ends (assuming it does), disable I/O registers and return to caller. Notice the bug in the algorithm: for F_Samples of render type "Pure PCM", the number of samples remaining will be reduced by one but not tested (it could now be zero) -- this effectively skips the next F_Sample. When step 3 occurs, the remaining samples will be reduced again (if it was zero, it will now be 255). And when step 4 occurs, the test for end-of-F_Samples will fail (if the number of samples remaining wrapped to 255). I think this bug got into the code because (as far as I can tell) the number of F_Samples per sub-phone of type "Pure PCM" is always an even number. And with the recommended setting of SPEED > PITCH, the "skipped" F_Sample will always be the second F_Sample of a "Pure PCM" pair. With those assumptions, the end-of-F_Samples would never be skipped. However, if the duration modification rules somehow created an odd-number of F_Samples for "Pure PCM" rendering (it doesn't seem possible, but I could have missed something) or if the PITCH > SPEED (which the user can easily do), then the skipped F_Sample may not be part of the "Pure PCM" sub-phone. If this happens on the very last F_Sample of a phrase, SAM will play all remaining "junk" in the F_Sample buffer, wrap around, and play the F_Sample buffer again (potential [very likely] infinite loop)! One way to fix the bug is to enforce the rule SPEED > PITCH (and also verify duration rules could not generate an odd number of F_Samples of type PCM rendering). A lot of trouble. One easy fix is to simply test for zero after the first decrement (before returning to the outer loop for the second decrement and test). Another easy method is to never decrement the number of samples remaining; instead compare the current F_Sample index with the total number and exit if greater or equal. Not only does this solve the problem, it reduces the number of code bytes and increases render speed. Only minor problem with this method is the value for TIMEBASE would need to be increased due to the faster CPU execution (which is probably a good thing, because at 1MHz speed, the original TIMEBASE can't be decreased at all for PAL machines and only by 1 for NTSC machines... more variety would be nice). SAM (64) © Don't Ask, Inc., 1982 SAM 128 © H2Obsession, 2015 Webpage © H2Obsession, 2015, 2016 |