For a current project, I’ve loaded country and city data from GeoNames, and since most of the guides are outdated, let me briefly summarize how to get a dump (available here – works for the “allCountries” file or single countries).
Step 1 - The Schema
The file is tab-delimited and follows the schema described here. The corresponding SQL table script is:
CREATE TABLE [dbo].[GeoNames](
[geonameid] [int] NOT NULL,
[name] [nvarchar](200) NULL,
NULL,
[alternatenames] [nvarchar](max) NULL,
[latitude] [float] NULL,
[longitude] [float] NULL,
[feature_class] [char](2) NULL,
NULL,
NULL,
NULL,
NULL,
NULL,
NULL,
NULL,
[population] [bigint] NULL,
[elevation] [int] NULL,
[dem] [int] NULL,
[timezone] [char](31) NULL,
[modification_date] [date] NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
Some of the schemas I found online had int
for the population field, which will fail when importing “World” (yes, that’s a record in the “allCountries” file) or even just “Asia.” Also, the cc2
field (alternative ISO codes for countries) was often too short (60 instead of 200 chars). The schema shown here will import today’s file (2016-07-22) with no errors.
Step 2 - File Conversion
We will be using Bulk Import for importing the file, and it has some specific requirements: UTF16 Encoding with BOM and \r\n
as line endings. It should be possible to convert the file (even at 1.3GB) with a modern file editor, but somehow the output file was not readable by Bulk Import. So the simplest thing I could think of was to write a small utility myself:
const string CITIES_FILE = @"c:\temp\allCountries\allCountries.txt";
const string CITIES_FILE_UTF16 = @"c:\temp\allCountries\allCountries_utf16.txt";
const string LINE_TERMINATOR = "\r\n";
using (StreamWriter sw = new StreamWriter(CITIES_FILE_UTF16, false, Encoding.Unicode))
using (StreamReader sr = new StreamReader(CITIES_FILE))
{
var line = sr.ReadLine();
sw.AutoFlush = false;
while ((line = sr.ReadLine()) != null)
{
sw.Write(line);
sw.Write(LINE_TERMINATOR);
}
}
Console.WriteLine("DONE - Press enter to continue...");
Console.ReadLine();
This will work in any Visual Studio version that compiles to a console application. It’s very basic: read the original file and write it in UTF16 (note: Encoding.Unicode
is UTF16) with the correct line endings. Sure, it’s not optimized for performance, but on my machine, this took less than a minute for 11 million rows.
Step 3 - Bulk Import
Finally, we can use the BULK INSERT
command in SQL Server to load the data into our table:
BULK INSERT GeoNames
FROM 'C:\temp\allCountries\allCountries_utf16.txt'
WITH (
DATAFILETYPE = 'widechar',
FIELDTERMINATOR = '\t',
ROWTERMINATOR = '\n'
)
GO
Interestingly, the ROWTERMINATOR
still has to be defined as \n
even though \r\n
is expected (and the file has \r\n
). I have no idea why, but it works.
Now you have about 11 million rows of country, city, zone, and region data in your database. Next up is the challenge of getting about 1GB of data into SQL Azure efficiently.
Step 4 - Load into SQL Azure
Create the schema table without any indexes (if you are planning on using Full Text Search, do all required steps for this AFTER importing the raw data into SQL Azure).
Using the bcp
(bulk copy) tool available in SQL Management Studio, this is a two-step process. First, export the data from your local database to a file (-T
is a trusted connection, -c
specifies character type; check other options with bcp /?
):
bcp [YOURDB].[dbo].[GeoNames] out "geonames.txt" -T -c -S localhost
Then, import the generated file to SQL Azure. Use a batch size of about 100K to avoid filling the transaction log on the destination server:
bcp [YOURDB].[dbo].[GeoNames] in "geonames.txt" -U YOURUSER@YOURDATABASE -P YOURPASSWORD -S YOURDATABASE.database.windows.net -c -b 100000
This process took a while on my computer, but you will see the progress for each batch and can estimate the total time on your machine.