Loading Geonames.org data into SQL Server and SQL Azure or “Why Integers just aren’t the right choice for population fields”

For a current project I’ve loaded country and city data from GeoNames and since most of the guides are outdated let me briefly summarize how to get a dump (available here – works for the “allCountries” file or single countries).

Step 1 – the schema

The file is tab-delimited and follows the schema described here. The corresponding SQL table script is:

CREATE TABLE [dbo].[GeoNames](
	[geonameid] [int] NOT NULL,
	[name] [nvarchar](200) NULL,
	[asciiname] [nvarchar](200) NULL,
	[alternatenames] [nvarchar](max) NULL,
	[latitude] [float] NULL,
	[longitude] [float] NULL,
	[feature_class] [char](2) NULL,
	[feature_code] [nvarchar](10) NULL,
	[country_code] [char](3) NULL,
	[cc2] [char](200) NULL,
	[admin1_code] [nvarchar](20) NULL,
	[admin2_code] [nvarchar](80) NULL,
	[admin3_code] [nvarchar](20) NULL,
	[admin4_code] [nvarchar](20) NULL,
	[population] [bigint] NULL,
	[elevation] [int] NULL,
	[dem] [int] NULL,
	[timezone] [char](31) NULL,
	[modification_date] [date] NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]

Some of the schemes I found online had int for the population field, which will fail when importing “World” (yes that’s a record in the “allCountries” file) or even just on “Asia”. Also the “cc2” field (alternative ISO codes for countries) was often too short (60 instead of 200 chars). The schema shown here will import today’s file (2016-07-22) with no errors.

2. File Conversion

We will be using Bulk Import for importing the file and it has some specific requirements: UTF16 Encoding with BOM and \r\n as Line Endings. It should be possible to convert the file (even at 1.3GB) with a modern file editor, but somehow the output file was not readable by Bulk Import. So the simplest thing I could think of was to write a small utility myself.

const string CITIES_FILE = @"c:\temp\allCountries\allCountries.txt";
const string CITIES_FILE_UTF16 = @"c:\temp\allCountries\allCountries_utf16.txt";
const string LINE_TERMINATOR = "\r\n";

using (StreamWriter sw = new StreamWriter(CITIES_FILE_UTF16, false, Encoding.Unicode))
using (StreamReader sr = new StreamReader(CITIES_FILE))
{
    var line = sr.ReadLine();
    sw.AutoFlush = false;
    while ((line = sr.ReadLine()) != null)
    {
        sw.Write(line);
        sw.Write(LINE_TERMINATOR);
    }
}
Console.WriteLine("DONE - Press enter to continue...");
Console.ReadLine();

This will work in any Visual Studio version that compiles to a console application. It’s very basic. Read the original, write it in UTF16 (note: Encoding.Unicode is UTF16) with the correct line endings. Sure, it’s not optimized for performance, but on my machine this took less than a minute for 11 million rows.

3. Bulk Import

Finally we can use the BULK IMPORT command in SQL Server to load the data into our table.

BULK
  INSERT GeoNames
      FROM 'C:\temp\allCountries\allCountries_utf16.txt'
            WITH(
                  DATAFILETYPE = 'widechar',
                  FIELDTERMINATOR = '\t',
                  ROWTERMINATOR = '\n'
                )
GO

Interestingly the ROWTERMINATOR still has to be defined \n even though \r\n is expected (and the file has \r\n). I have no idea why, but it works.

Now you have about 11 million rows of country, city, zone, region data in your database. Next up is the challenge of getting about 1GB of data into SQL Azure efficiently.

4. Load into SQL Azure

Create the Schema table without and indexes (if you are planning on using Full Text Search, do all required steps for this AFTER importing the raw data into SQL Azure).

Using the bcp (bulk copy) tool available in SQL Management Studio this is a two step process. First you export the data from your local database to a file (-T is a trusted connection, -c specifies character type, check the other options with “bcp /?”)

bcp [YOURDB].[dbo].[Geonames] out "geonames.txt" -T -c -S localhost

Then import the generated file to SQL Azure. Use a batch size of about 100K in order not to fill the transaction log on the destination server.

bcp [YOURDB].[dbo].[Geonames] in "geonames.txt" -U YOURUSER@YOURDATABASE -P YOURPASSWORD -S YOURDATABASE.database.windows.net -c -b 100000

This process took a while on my computer but you will see the progress for each batch and can estimate the total time on your machine.

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s