SkyDrive: Why aren’t thou nicer to me? or “How to crawl pages and download all images using C#”

Ok, SkyDrive is still a baby, and personally I’ve used other services providing file space in the cloud and enjoyed them a little better. But this post isn’t about if SkyDrive is good or bad, it’s just about a missing feature that is very painful. Someone wanted to share some fotos, uploaded them to SkyDrive and all I wanted was to download them all to my PC. Tough look, you can click on each and every image to get to the preview page, where you click on the preview picture to then finally get at the actual picture. Multiply that by about 100. I have better things to do than waste my time on that.

So a Dev does what he does best, fires up Visual Studio 2008 and hacks away (did I just say I had something better to do – well I lied partially, but before I go off to do that, there is always time for some good ol’ C#).

I’ve posted it here not as a finished utility (there are no binaries) but as a small sample. Using WebClients, RegEx and some other stuff it downloads the list page of the SkyDrive folder, fetches the preview page and then downloads the actual image to a folder on the hard disk. Not really rocket science and of course there are a few quirks (no real error handling for example), but it’s just a sample. Feel free to extend as you wish, don’t blame me if it starts downloading Gigabytes of files overnight, because you accidentally crawled a HoneyPot. (And yes, it only downloads jpgs at the moment. I didn’t need any other types.)

May those SkyDrive bytes be with you…

/**********************************************************************************
*
* Example Application for crawling web pages and downloading images.
*
* This code works if you pass in a SkyDrive Folder Url (http://…. /browse.aspx/…)
* and will download any jpg images it finds in there.
*
* Permission to use, copy, modify, distribute and sell this software and its
* documentation for any purpose is hereby granted without fee.
* I make no representations about the suitability of this software for any purpose.
* It is provided “as is” without express or implied warranty.
*
* Alex Duggleby – 24.05.08 – V0.9 – http://alexduggleby.com
*
**********************************************************************************/
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Net;
using System.Text.RegularExpressions;
using System.IO;
using System.Web;
using System.ComponentModel;

namespace Tools.SkyDrive.DownloadAll
{
class Program
{
// Used for tracking how many items we have left
private static int _wcInnerCount = 0;
private static int _wcInnerCompleted = 0;

// We have to start somewhere
private static Uri _uriStart;

// Work we have already done
private readonly static List _urisCrawled = new List();
private readonly static List _imagesDownloaded = new List();

// Download images to?
private readonly static DirectoryInfo _diDownloadTo = new DirectoryInfo(Path.Combine(Path.Combine(System.Environment.GetFolderPath(Environment.SpecialFolder.Personal), “Downloads”),”Images”));

// This finds urls in the page
private readonly static Regex _regexUrl = new Regex(“href\\s*=\\s*(?:(?:\\\”(?[^\\\”]*)\\\”)|(?[^\\s]* ))”);

// This finds the open url in the image page
private readonly static Regex _regexUrlOpen = new Regex(“href\\s*=\\s*(?:(?:\\\”(?[^\\\”]*)\\\”)|(?[^\\s]*)) title=\\\”Open\\\””);

///

/// Takes the url to a skydrive folder page and downloads all jpg images.
///

static void Main(string[] args)
{
// Usage check
if (args.Length != 1)
{
Console.WriteLine(“Usage: App.exe http://theUrlToThe/SkyDrive/FolderPage”);
return;
}

try
{
// First parameter is url
_uriStart = new Uri(args[0]);
}
catch (Exception _ex)
{
Console.WriteLine(“Invalid Url. ” + _ex.Message);
return;
}

// Make sure download directory exists
if (!_diDownloadTo.Exists) _diDownloadTo.Create();

using (WebClient _wc = new WebClient())
{
// This is the index with all the images
string _pageContents = _wc.DownloadString(_uriStart);

// Each image has a preview page, so we get the url to that, before we get the url to the actual image
foreach (Match _matchUrlToImagePage
in _regexUrl.Matches(_pageContents))
{
Uri _uriToImagePage =
new Uri(_uriStart, HttpUtility.HtmlDecode(_matchUrlToImagePage.Groups[“url”].Value));

CrawlPreviewPage(_uriToImagePage);
}
}

// Wait for the async web clients to complete…
while (_wcInnerCompleted < _wcInnerCount) { Console.WriteLine("Wait for images to complete..."); Console.ReadLine(); } Console.WriteLine("Should be finished!"); Console.ReadLine(); } ///

/// Parses the preview page and finds the actual image link
///

/// The url to the preview page ///
private static void CrawlPreviewPage(Uri uriToImagePage)
{
using (WebClient _wc = new WebClient())
{
if (!_urisCrawled.Contains(uriToImagePage.ToString()))
{
_urisCrawled.Add(uriToImagePage.ToString());

if (uriToImagePage.ToString().ToLower().EndsWith(“.jpg”))
{
string _pageContents = _wc.DownloadString(uriToImagePage);

// Find the image we want to download… There should be
// only one link with title=”Open” in it.
foreach (Match _matchImage in _regexUrlOpen.Matches(_pageContents))
{
Uri _uriToImage = new Uri(_matchImage.Groups[“url”].Value);

DownloadImage(_uriToImage);
}
}
}
}
}

///

/// Downloads async’ly an image from a Uri
///

/// The uri to download private static void DownloadImage(Uri uriToImage)
{
// Output the url
Console.WriteLine(“{0}{1}”, uriToImage.ToString(), Environment.NewLine);

if (!_imagesDownloaded.Contains(uriToImage.ToString()))
{
_imagesDownloaded.Add(uriToImage.ToString());
string _lowerUrl = uriToImage.ToString().ToLower();

// Simple checking
if (_lowerUrl.EndsWith(“.jpg”) &&
(!_lowerUrl.Contains(“browse”)) &&
(!_lowerUrl.Contains(“self”)))
{
// HtmlDecode here because some urls have encoded characters
string _localFilename = HttpUtility.HtmlDecode(
uriToImage.Segments[uriToImage.Segments.Length – 1]);

// Create a valid local filename
Path.GetInvalidPathChars().ToList().ForEach(
c => _localFilename = _localFilename.Replace(c, ‘_’));

Console.Write(“Downloading {0}…{1}”, _localFilename, Environment.NewLine);

// Create a seperate web client for each image (uses async, and you can’t
// issue two downloads at the same time for the same client). Of course
// here we should be using some kind of pooling but this is the quickest
// way to do it.
using (WebClient _wcInner = new WebClient())
{
_wcInnerCount++;
_wcInner.DownloadFileAsync(uriToImage, Path.Combine(_diDownloadTo.ToString(), _localFilename));
_wcInner.DownloadFileCompleted += new AsyncCompletedEventHandler(_wcInner_DownloadFileCompleted);
}
}
}
}

// Is fired when a download complete. We output status and check if we are finished!
private static void _wcInner_DownloadFileCompleted(object sender, AsyncCompletedEventArgs e)
{
// Increase the completed counter
_wcInnerCompleted++;

// Ok, we could do some more extensive checking, this could trigger
// even if there are still items to download… but hey, it’s just a
// quick utility!
if (_wcInnerCompleted == _wcInnerCount)
{
Console.WriteLine(“{0}{1}{2}”, Environment.NewLine, “Finished all files!”, Environment.NewLine);
Console.ReadLine();
}
else
{
Console.WriteLine(“File {0} of {1} completed!”, _wcInnerCompleted, _wcInnerCount);
}
}
}
}

Advertisements

6 thoughts on “SkyDrive: Why aren’t thou nicer to me? or “How to crawl pages and download all images using C#”

  1. Nice! I made a huge mistake of uploading all my photos into skydrive, I mean, why not? 5 gigs of storage. I thought also, once they are up there, I can organize them very easily…nope… So long story short, I got to find a way to get these image files off of skydrive…bleh..

  2. sounds good, I only have no clue how to use this..(I’m just a pc user, no developer); any more details to share>? thanks, Wil

  3. just found out you can download the folder(s) as a zipfile from the skydrive… so, no need to tell me anymore ;-)

  4. Hey Wil,

    I’m trying to download the folder as a zipfile but can’t find where or how.
    Can you help ?

    Thx
    Marco

  5. I’m trying to compile the code and am running into the following errors:

    C:\tools\test>csc skydrove1.cs
    Microsoft (R) Visual C# 2008 Compiler version 3.5.30729.1
    for Microsoft (R) .NET Framework version 3.5
    Copyright (C) Microsoft Corporation. All rights reserved.

    skydrove1.cs(141,1): error CS0116: A namespace does not directly contain members
    such as fields or methods
    skydrove1.cs(166,28): error CS1518: Expected class, delegate, enum, interface,
    or struct
    skydrove1.cs(170,39): error CS1518: Expected class, delegate, enum, interface,
    or struct
    skydrove1.cs(177,16): error CS1518: Expected class, delegate, enum, interface,
    or struct

    I’m using csc, have you compile the above code? I would like to use it to download zip/rar/7z from skydrive

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s