11 Nov, 2009, Barm wrote in the 1st comment:
Votes: 0
A few years ago I was playing around with a random name generator. My approach was to cobble together random letter combination like;
leading consonants + vowels + inner consonants + vowels + closing consonants

Basically, I was aiming for something pronounceable with commons letters weighted to appear more often. It produced output such as;

Votharn Eristacark Iplortidot Birtoil Udaeteahieb Aceastoherk Reloist Tharnog Wasterk Femewelav Ublyrrielic Cekird Owritothol Hoogoh Obloukajarriem Sleebont Niestart Pekev Lirtooth Efentoidagix Klyckas Yryfesat Klooton

Yeah … that's really, really awful.

So I decided to give it another whack. This time I started with the premise, 'what sounds most like a name?'

Names do!

I found a couple files with over a thousand of the most common male and female first names on the US Census Bureau's web page and started playing. I wrote a Python script that used regular expressions to slice a batch of words into three lists;

List 1 = Zero or more vowels + One or more consonants at the start of the word
List 2 = One or more vowels + One or more consonants inside the word (not at the start or the end). We can get 0 or more of these patterns depending on the word.
List 3 = One or more vowels + Zero or more consonants at the end of the word.

Side note: If you haven't dug into regular expressions yet I highly recommend you check them out. I avoided them for years and now they're an essential part of my programmer tool box. Another big plus is their utility spans multiple languages.

I also tracked the frequency of each pattern, sorting by most common first and discarding the rares. Finally, I dumped the output formatted as Python lists that I could paste right into the source of the next script.

#!/usr/bin/env python

#——————————————————————————
# analyze.py
#——————————————————————————

import re
import random
import operator


_FILENAME = 'data/female2.txt'

## Match 0 or more vowels + 1 or more consonants at the start of the word
_LEAD = re.compile(r'^[aeiouy]*[bcdfghjklmnpqurstvwxz]+')
## Match 1 or more vowels + 1 or more consonants inside a word (not start/end)
_INNER = re.compile(r'\B[aeiouy]+[bcdfghjklmnpqurstvwxz]+\B')
# Match 1 or more vowels + 0 or more consonats at the end of a word
_TRAIL = re.compile(r'[aeiouy]+[bcdfghjklmnpqurstvwxzy]?$')


def token_lists(names):

lead, inner, tail = {}, {}, {}

## Populate dictionaries; key=pattern, value=frequency
for name in names:

match = re.match(_LEAD, name)
if match:
pat = match.group(0)
count = lead.get(pat,0)
lead[pat] = count +1

matches = re.findall(_INNER, name)
for pat in matches:
count = inner.get(pat,0)
inner[pat] = count +1

match = re.search(_TRAIL, name)
if match:
pat = match.group(0)
count = tail.get(pat,0)
tail[pat] = count +1

## Convert dicts to a list of tuples in the format (pattern, frequency)
lead_srt = sorted(lead.items(),key=operator.itemgetter(1),reverse=True)
inner_srt = sorted(inner.items(),key=operator.itemgetter(1),reverse=True)
tail_srt = sorted(tail.items(),key=operator.itemgetter(1),reverse=True)

## Build lists of patterns ordered most to least frequent and cull rares
lead_list = [ x[0] for x in lead_srt if x[1] > 4 ]
inner_list = [ x[0] for x in inner_srt if x[1] > 4 ]
tail_list = [ x[0] for x in tail_srt if x[1] > 4 ]

return lead_list, inner_list, tail_list


if __name__ == '__main__':

names = open(_FILENAME, 'rt').readlines()
lead_list, inner_list, tail_list = token_lists(names)

print '#', len(lead_list), len(inner_list), len(tail_list)
print '_LEADS = ', lead_list
print '_INNERS = ', inner_list
print '_TAILS = ', tail_list



Next I used a script to assemble random names from these lists. Here's the one for male names:

#!/usr/bin/env python

import random

_LEADS = ['m', 'd', 'j', 'r', 'l', 'w', 'c', 'h', 'g', 'b', 't', 'br', 'k',
's', 'n', 'cl', 'fr', 'f', 'p', 'st', 'v', 'sh', 'ch', 'gr', 'tr', 'ant']

_INNERS = ['er', 'ar', 'el', 'or', 'ic', 'arr', 'an', 'am', 'on', 'ol', 'al',
'en', 'ill', 'in', 'err', 'and', 'il', 'om', 'arl', 'et', 'ev', 'ac',
'av', 'ert', 'enn', 'ent', 'ath', 'onn', 'ust', 'enc', 'ist', 'anc', 'it',
'ich', 'os', 'est', 'ald', 'erm', 'ern', 'yr', 'em', 'ew', 'ob', 'uc',
'as']

_TAILS = ['o', 'on', 'e', 'y', 'ey', 'er', 'in', 'an', 'ie', 'io', 'en',
'is', 'el', 'us', 'es', 'as', 'ian', 'ed', 'or', 'uel']


def namegen():
syllables = random.randint(0,1)
if random.random() > .85:
syllables += 1
name = random.choice(_LEADS)
for x in range(syllables):
name += random.choice(_INNERS)
name += random.choice(_TAILS)
return name.title()


if __name__ == '__main__':

for x in range(100):
name = namegen()
print name,


Which gives output like;

Shen Grolon Mor Warred Hilluel Huel Pie Reror Jo Frewed Pin Rasy Steris Tron Chelo Nor Stomey Stennuel Sernas Golian Karrernin Varren Kichey Hemio Grerto Bas Wo Wathas Fandes Ferrey Kan Fines Lanin Clancuel Vy Pacosian Ke Shie Rathis Dor Ce Frencermis Van Nances Fones Reny Kas Shathus Brer Jilled Vio Feton Testin Hian Bed Rio Trewavan Nernel Antamalor Cance Stomio Venno Lel Grus Antel Jey Holis Kas Stie Lewo Tille Vin Hey Jon Trencor Bancis Fie Stel Fruel Brestin Javen Cancen Sel Trarrus Brarlitio Kertes Cie Perned Fris Storas Stey Grarly Dor Grio Won Gentian Stan Clichy Das Tennan

Seeding with female names, I get;

Lindi Elon Pacia Tulee Gia Koneen Shabee Stan Siannis Vory Stannin Kesses Stinie Re Genistian Krolonah Ladee Perie Detter Clel Chillistyn Angami Han Can Pesten Shulis Alie Worianon Choryn Neana Clindetis Angannian Pishia Man Chres Bralan Cilla Pissie Free Veen Vinon Paly Angancy Chran Jabi Ali Shes Cerran Sissan Ner Katia Sten Dah Mandrie Den Tilis Nancer Chey Cenny Mamah Angisheen Wian Deen Jennameen Kelie Stacian Chrerrian Tren Doson Loler Kria Ci Claces Kandree Try Tande Chianan Lishia Sadis Belon Clady Closonel Rindah Krestah Elian Son Triny Jin Chis Hianisie Angilli Tannee His Silli Vy Kiannie Chraria Lulon Wy Brissyn

I've been experimenting with different sample sizes and varying amounts of culling infrequent patterns. Plus, it's hard to gauge success. You wouldn't want to name your children from those lists, but if I had to populate a fantasy town full of NPCs I'd be content with many of those. Two things I liked was the simplicity of the finished code and that the gender sound mostly survived the mulching process – except Stan the transvestite.
11 Nov, 2009, Idealiad wrote in the 2nd comment:
Votes: 0
I can't remember if it was Bartle who did a name generator using a huge index of Mongolian names…I think it gave convincing Dwarvish names.;D

Awesome post Barm, thanks for the code.
11 Nov, 2009, Runter wrote in the 3rd comment:
Votes: 0
I've seen some pretty good random name generators that even let you generate a name based off of different fantasy races, but the neatest feature I thought it had was it let you check a name to see if it was valid or not also based on if it sounds dwarven, human, etc. Probably a little strict in actual application but I thought it was pretty cool. (Even though I'd never use it probably.)
11 Nov, 2009, Tricky wrote in the 4th comment:
Votes: 0
I did a similar thing in LPC last year.

LPC name generator

Since the original was written in 'C' by Ian Bell it can easily be converted into a snippet for Diku based muds.

Tricky
11 Nov, 2009, Hades_Kane wrote in the 5th comment:
Votes: 0
This is my favorite web based name generator:
http://www.rinkworks.com/namegen/

I've been keeping an eye out for a MUD compatible one though.

I don't really know anything about Python, however, but still, it seemed like a very helpful piece of code :)
11 Nov, 2009, Barm wrote in the 6th comment:
Votes: 0
I was going to mention having made demon and cthulhu-esque name generators but the post was getting kinda long already. I thought medieval weapons would make good dwarven names but it didn't work so hot.

The demon names look like this;

Amusamion Lucathin Negobias Mamu Andrehotuson Shilastetam Memusadu Kelis Falphax Zevur Andravius Mukabo Grotobu Bobabu Hegan Lusem Andrady Ziarocub Amagam Orucos Genas Bucy Fovo Canukim Shamecou Zathobam Farepis Lucegathim Gantehin Agiarobas Reries Amasis Andromantes Aborite Astekasies Rosur Chatartado Notelor Namo Matham Euronaso Talphel Conavel Dosatem Tasty Andrerias Apeeravi Perou Conalphal Civinius Bevasas Lonynor Andragis Dusub Grakius Cilago Harevion Kynolim Lusartanou Bonia Mantalpham Criarepe Grenalphos Halphion Chekigis Ducah Lehelah Miuolel Maphusin Hitim Loce Belahia Byalenor Astasados Peeru Kantantan Mevathia Andriarasit Vyalam Pamy Polivekal Shorgonit Mitynilah Tavamit Sathah Lucagovan Sageeravies Apurartes Lucilal Euravomem Gryala Hukomion Lathagis Amyalemis Riaris Aliueri Geerub Lucepes Apeeran Fitigaphos
11 Nov, 2009, KaVir wrote in the 7th comment:
Votes: 0
Runter said:
I've seen some pretty good random name generators that even let you generate a name based off of different fantasy races, but the neatest feature I thought it had was it let you check a name to see if it was valid or not also based on if it sounds dwarven, human, etc. Probably a little strict in actual application but I thought it was pretty cool. (Even though I'd never use it probably.)

Reminds me of my first snippet, which checked whether names could be pronounced, whether they contained inappropriate words, and whether they matched race-specific requirements. I never used it either though.
11 Nov, 2009, Skol wrote in the 8th comment:
Votes: 0
There was one that a guy named Johan did online, then it was gone, but he had left his sources up for people to play with.

I found a redo of it: http://www.allanime.org/?207
Might be something to look at the name chunks if people want certain genres of names.

Nice project Barm!
11 Nov, 2009, Skol wrote in the 9th comment:
Votes: 0
Found the actual source on wayback machine,
http://web.archive.org/web/2005020720261...

The 'rules' is the files with the name bits I believe, a few versions of the source there as well. Could prove interesting study.
11 Nov, 2009, Runter wrote in the 10th comment:
Votes: 0
Skol said:
Found the actual source on wayback machine,
http://web.archive.org/web/2005020720261...

The 'rules' is the files with the name bits I believe, a few versions of the source there as well. Could prove interesting study.


Internet time machine. Fun site. :)
11 Nov, 2009, Barm wrote in the 11th comment:
Votes: 0
I made a change to my regex. Basically I gave up on the letter 'Q'. I had included 'U' in the consonants above because otherwise a 'Q' before it was left alone. Unfortunately, it was matching strings like 'cthulhu' as one leading pattern. Getting cleaner results.

_LEAD = re.compile(r'^[aeiouy]*[bcdfghjklmnprstvwxz]+')
_INNER = re.compile(r'\B[aeiouy]+[bcdfghjklmnprstvwxz]+\B')
_TRAIL = re.compile(r'[aeiouy]+[bcdfghjklmnprstvwxzy]?$')
12 Nov, 2009, Barm wrote in the 12th comment:
Votes: 0
Okay the ignoring 'Q' thing was bugging me, especially for code that's suppose to be analyzing input. I reworked the expressions to properly handle 'Q' and 'QU'.

_LEAD = re.compile(r'^[aeiouy]*(?:qu|[bcdfghjklmnpqrstvwxz])+')
_INNER = re.compile(r'\B[aeiouy]+(?:qu|[bcdfghjklmnpqrstvwxz])+\B')
_TRAIL = re.compile(r'[aeiouy]+(?:qu|[bcdfghjklmnpqrstvwxz])+$')


Here's the output seeded with the names of Tolkien elves;

Telodas Orelion Lebriel Elolan Felel Pellel Cas Imelil Orin Mellunion Eraron Nod Non Fod Podir Pir Erorin Ingwingod Amrindas Liel Erir Neliel Cos Erinel Dis Pel Gegoth Oraerebror Amron Tophalil Enolis Lotas Amrod Inglotir Orebroriel Nellir Cingin Ororiel Fellas Eningas Tinaldin Dil Todoth Ingwir Ingwalor Dil Ingwotiel Ingluilod Ingwindis Ingwiel Munarion Tellotod Cos Dophos Erelil Inglophil Eras Fion Gor Lorod Elophis Oris Fir Ingworas Pon Imiron Mion Diel Elil Enaldegan Nin Eron Amriras Gaeringan Din Erotos Lindis Poth Cuiliel Mis Inglos Inglelel Enirion Enoth Tan Gellin Erir Enis Tel Pellas Lion Oron Oruiliel Fatos Cis Fegalel Mebror Ingwatinan Gis Fodion

And Blizzard Orcs;

Zuul Tus Tom Orek Tharar Grok Nuul Ka Krelgruul Kuk Nelmul Huul Krelgrom Sana Ormuul Shek Tus Gar Snan Mom Runan Bo Shelmar San Man Bomuk Ormin Tan Shagag Korgek Tul Grelma Namak Tuk Hamar Thanu Ormelgrak Zak Tamek Shomanan Tan Zarom Thamo Orar Sul Zathama Belgranak Magag Tak Ormelmuk Oro Grak Nu Relgro Zuk Orunin Nan Orunag Tomar Sorar Nus Kus Tar Selma Krandom Orman Zoruk Zu Than Orul Kandok Nagek Tag Melgrus Orar Torgin Thunar Zul Karul Rus Ormar Tek Krorgek Thin Shomuk Krarak Ororag Kunom Zak Zathek Morga Zagek Koru Gus Shorag Mamo Zomuk Belgru Ora Zag
12 Nov, 2009, Skol wrote in the 13th comment:
Votes: 0
Looking good Barm!
I didn't see it do any 'qu' in the output, have you iterated a few to see how it does?
12 Nov, 2009, Barm wrote in the 14th comment:
Votes: 0
Good question, Skol. Turns out that for the 91 Tolkien elves I found names for, only one, Quennar, had a 'Q' in it. It was getting dropped due to rarity. If I turn off culling I get;

Ras Angralmeth Ingwithril Thradhorm Gindund Inglar Amuor Seneth Glaedhrunian Quabloth Thrim Amden Threth Amdahtoth Amraeregir Ingwandimbas Amder Mir Egolfor Glaglen Argorn Olwimrim Gladrien Angrorm Onis Hahtad Enonwor Tegnil Mimin Indung Molfuor Fingen Inglodh Throdrad Onetion Inglaglimbeg Earwadhel Cathod Ecthund Pumas Eris Ropher Elian Eldenwir Eros Bad Idretor Sian Sablodh Mer Hos Irorn Angroroth Elmildengund Gwarforn Deston Gluiliel Indin Glegnerdan Egethas Boran Quodrel Amdien Oneth Cuil Glonwoth Siel Sion Ingweth Aror Huil Amuvimreg Cad Enolfad Polian Din Inglon Eldim Amrahtien Earwurgoth Rodagas Quetheg Oretoth Ingluennoth Badhuor Carwan Dingel Threl Eldirimlorn Earwegnuil Elmatorn Angrumar Gwerdor Renwer Gworn Quodoth Earwien Bung Enathen Muil
12 Nov, 2009, shasarak wrote in the 15th comment:
Votes: 0
I'm not totally convinced by a demon named "Tasty", but otherwise that's very impressive. :smile:
12 Nov, 2009, David Haley wrote in the 16th comment:
Votes: 0
Have you tried using n-grams to generate the names? It works by chopping up inputs into units (probably syllables in this case), and then determining not the frequency of individual units, but the probability of a given unit following another unit. This way, you end up with units more likely to actually seem to fit together.

I have code for this; it's roughly 350 lines of Lua (including whitespace and comments). I ran it on Shakespeare and the WSJ; the units in this case were words, not syllables. Here are some of the more interesting generated sentences:

- An he is old oblivion; I change favours; I am yours; and Warwick was free scope; I warrant. (Shakespeare, bigrams)

- As to run from the day is well he chang'd into the next village of the Emperor, nay, to give it be restrain'd, and enter THESEUS. (Shakespeare, bigrams)

- Regulators, as compiled by Dow Jones's board, had structural damage, including International Business Machines Corp. hardware that uses index arbitrage at Kidder. (WSJ trigrams)

- "According to available details," says a spokeswoman, the changes were prompted by a third of the stadium, damaged by natural disasters – Hurricane Hugo, " May 15) will recall that the company, which are filled at the end of the U.K. (WSJ trigrams)

- The English and attorney who have been performing since the quake won't make this century ago, citing its results anyway. (WSJ bigrams)



I could probably run the code almost as-is if I had an easy way of breaking words into chunks. Maybe I'll even get around to posting the code at some point.
12 Nov, 2009, Barm wrote in the 17th comment:
Votes: 0
David Haley said:
Have you tried using n-grams to generate the names? It works by chopping up inputs into units (probably syllables in this case), and then determining not the frequency of individual units, but the probability of a given unit following another unit. This way, you end up with units more likely to actually seem to fit together..


Originally, I was going to try using Markhov chains but the first thing I ran into was how do I actually break a word into syllables? Take 'pewter' and 'marker' for example. Most people would break them as (pew) (ter) and (mark) + (er). No pattern to that. I didn't really want to manually specify or build a speech synthesis style dictionary (especially when dealing with fantasy names). That's where I took the easier route of (p)(ewt)(er) and (m)(ark)(er) and building from there.
12 Nov, 2009, David Haley wrote in the 18th comment:
Votes: 0
I'd be astonished if there weren't libraries out there for breaking words into syllables that work "well enough" (most of the time), similar to parts-of-speech tagging (which is some 97% accurate). I agree that you don't really want to be in that business yourself, but even so I imagine you could get a decent approximation without too much work. (Incidentally, I would have grouped those as pew-ter and mar-ker, so there is a pattern of some sort.)
12 Nov, 2009, Barm wrote in the 19th comment:
Votes: 0
Well, I did give up after an exhaustive 20 seconds or so. I may take another crack at the syllable approach.

Edit: Darn it, now you have me re-thinking the whole thing. ;-P
14 Nov, 2009, Koron wrote in the 20th comment:
Votes: 0
David Haley said:
Have you tried using n-grams to generate the names? It works by chopping up inputs into units (probably syllables in this case), and then determining not the frequency of individual units, but the probability of a given unit following another unit. This way, you end up with units more likely to actually seem to fit together.

I have code for this; it's roughly 350 lines of Lua (including whitespace and comments). I ran it on Shakespeare and the WSJ; the units in this case were words, not syllables. Here are some of the more interesting generated sentences:

- An he is old oblivion; I change favours; I am yours; and Warwick was free scope; I warrant. (Shakespeare, bigrams)

- As to run from the day is well he chang'd into the next village of the Emperor, nay, to give it be restrain'd, and enter THESEUS. (Shakespeare, bigrams)

- Regulators, as compiled by Dow Jones's board, had structural damage, including International Business Machines Corp. hardware that uses index arbitrage at Kidder. (WSJ trigrams)

- "According to available details," says a spokeswoman, the changes were prompted by a third of the stadium, damaged by natural disasters – Hurricane Hugo, " May 15) will recall that the company, which are filled at the end of the U.K. (WSJ trigrams)

- The English and attorney who have been performing since the quake won't make this century ago, citing its results anyway. (WSJ bigrams)



I could probably run the code almost as-is if I had an easy way of breaking words into chunks. Maybe I'll even get around to posting the code at some point.

Congratulations, you coded your very own spam bot!
0.0/30