Loading Geonames.org data into SQL Server and SQL Azure or "Why Integers just aren't the right choice for population fields"

For a current project, I’ve loaded country and city data from GeoNames, and since most of the guides are outdated, let me briefly summarize how to get a dump (available here – works for the “allCountries” file or single countries).

Step 1 - The Schema

The file is tab-delimited and follows the schema described here. The corresponding SQL table script is:

CREATE TABLE [dbo].[GeoNames](
    [geonameid] [int] NOT NULL,
    [name] [nvarchar](200) NULL,
      NULL,
    [alternatenames] [nvarchar](max) NULL,
    [latitude] [float] NULL,
    [longitude] [float] NULL,
    [feature_class] [char](2) NULL,
      NULL,
      NULL,
      NULL,
      NULL,
      NULL,
      NULL,
      NULL,
    [population] [bigint] NULL,
    [elevation] [int] NULL,
    [dem] [int] NULL,
    [timezone] [char](31) NULL,
    [modification_date] [date] NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]

Some of the schemas I found online had int for the population field, which will fail when importing “World” (yes, that’s a record in the “allCountries” file) or even just “Asia.” Also, the cc2 field (alternative ISO codes for countries) was often too short (60 instead of 200 chars). The schema shown here will import today’s file (2016-07-22) with no errors.

Step 2 - File Conversion

We will be using Bulk Import for importing the file, and it has some specific requirements: UTF16 Encoding with BOM and \r\n as line endings. It should be possible to convert the file (even at 1.3GB) with a modern file editor, but somehow the output file was not readable by Bulk Import. So the simplest thing I could think of was to write a small utility myself:

const string CITIES_FILE = @"c:\temp\allCountries\allCountries.txt";
const string CITIES_FILE_UTF16 = @"c:\temp\allCountries\allCountries_utf16.txt";
const string LINE_TERMINATOR = "\r\n";

using (StreamWriter sw = new StreamWriter(CITIES_FILE_UTF16, false, Encoding.Unicode))
using (StreamReader sr = new StreamReader(CITIES_FILE))
{
    var line = sr.ReadLine();
    sw.AutoFlush = false;
    while ((line = sr.ReadLine()) != null)
    {
        sw.Write(line);
        sw.Write(LINE_TERMINATOR);
    }
}

Console.WriteLine("DONE - Press enter to continue...");
Console.ReadLine();

This will work in any Visual Studio version that compiles to a console application. It’s very basic: read the original file and write it in UTF16 (note: Encoding.Unicode is UTF16) with the correct line endings. Sure, it’s not optimized for performance, but on my machine, this took less than a minute for 11 million rows.

Step 3 - Bulk Import

Finally, we can use the BULK INSERT command in SQL Server to load the data into our table:

BULK INSERT GeoNames
FROM 'C:\temp\allCountries\allCountries_utf16.txt'
WITH (
    DATAFILETYPE = 'widechar',
    FIELDTERMINATOR = '\t',
    ROWTERMINATOR = '\n'
)
GO

Interestingly, the ROWTERMINATOR still has to be defined as \n even though \r\n is expected (and the file has \r\n). I have no idea why, but it works.

Now you have about 11 million rows of country, city, zone, and region data in your database. Next up is the challenge of getting about 1GB of data into SQL Azure efficiently.

Step 4 - Load into SQL Azure

Create the schema table without any indexes (if you are planning on using Full Text Search, do all required steps for this AFTER importing the raw data into SQL Azure).

Using the bcp (bulk copy) tool available in SQL Management Studio, this is a two-step process. First, export the data from your local database to a file (-T is a trusted connection, -c specifies character type; check other options with bcp /?):

bcp [YOURDB].[dbo].[GeoNames] out "geonames.txt" -T -c -S localhost

Then, import the generated file to SQL Azure. Use a batch size of about 100K to avoid filling the transaction log on the destination server:

bcp [YOURDB].[dbo].[GeoNames] in "geonames.txt" -U YOURUSER@YOURDATABASE -P YOURPASSWORD -S YOURDATABASE.database.windows.net -c -b 100000

This process took a while on my computer, but you will see the progress for each batch and can estimate the total time on your machine.