Bubble.ro - because there is always something new to learn about

Convert the AOL user session collection to a MySQL database

 

Category: Programming/PHP

The initial distributed file was AOL-data.tgz (439MB). Inside it contains a simple readme file and 10 .gz archives, each containing a CSV collection. With a packed size of about 45MB each, when extracted they reveal the true treasure: a nice collection of actual user searches (each file at about 200MB).

Although AOL removed the file from their website, it was spread to far too many people already and continues to distribute over the Web or through Torrent systems.

Create the MySQL database

I have only worked with the first file since it's a huge database and quite hard to extract some useful information even from a single file, but the operations are the same for each part.

Why would you want to convert it to an actual database format? Well, if you know the power of SQL this question is very much answered by itself, if not, let's just say you can search, group and order data that you are looking for faster and easier than by other means.

The first step would be to create the database structure:


`aol` table structure
CREATE TABLE `aol` (
 `query_id` INT(11) UNSIGNED NOT NULL AUTO_INCREMENT,
 `AnonID` INT(10) UNSIGNED NOT NULL,
 `Query` VARCHAR(255) NOT NULL,
 `QueryTime` DATETIME NOT NULL,
 `ItemRank` TINYINT(3) UNSIGNED NOT NULL,
 `ClickURL` VARCHAR(255) NOT NULL,
 PRIMARY KEY (`query_id`)
 );
You can add an index on the Query field if you wish, also you can remove the `query_id` if you are looking for high INSERT speed, but I would not recommend it if you wish to browse through results.

The PHP script to parse the file

Next, I've created a small PHP script to parse the plain text file and run queries in order to populate the database.

<?php

set_time_limit (1800); // set the time limit - parsing the file can take several minutes
 
$file = "user-ct-test-collection-01.txt"; // change this for which file you want to parse

 
$link = mysql_connect("localhost", "root", "")
or die("Could not connect");

mysql_select_db("aol") or die("Could not select database.");

 
$fd = fopen ($file, "r");
 
$i = 0;
 

while (!feof ($fd))
{
$buffer = fgets($fd, 4096);
 

$a = explode("\t", $buffer);

if (!is_numeric($a[0])) continue;

// group the queries in packs of 100 (you can change the number)

// - this should improve speed by sending less queries to the database

if (($i%100) == 0)
{
if ($i>0)

{
$result = mysql_query($sq);

$err = mysql_error();

if (!empty($err))

{
echo "Error on line ".$i.": ".$err."
Query: "
.$sq."

"
;
}

}

// the table was called `aol`, you can change this to meet your needs

$sq = "INSERT INTO `aol` ( `AnonID` , `Query` , `QueryTime` , `ItemRank`
, `ClickURL` )
VALUES (
'"
.$a[0]."', '".addslashes($a[1])."', '".$a[2]."', '".$a[3]."',
'"
.$a[4]."'
)"
;
}

else
{
$sq .= ", (
'"
.$a[0]."', '".addslashes($a[1])."', '".$a[2]."', '".$a[3]."',
 '"
.addslashes($a[4])."'
)"
;
}


$i++;

}
 
echo "done.";
 
fclose ($fd);
 
mysql_close($link);

?>
 

The result:

initial file: 212MB

mysql db: 224MB (264MB with index on Query)

3,558,400 records

All you need to do now is Enjoy the results!

Posted by: Indy on August 10, 2006 at 20:53.
 

» Comments

Web Developer
Thanks, this script really saved my ass. I was using single insert queries on a text file and it was causing my script to take a very very long time to run. Grouping multiple inserts into a single query made my insert of 14,000 rows go from 10+ minutes to about 10 seconds!

Posted by Daniel on August 3, 2007 at 06:08 PM.

Thank you for the script.
It was fast evenon my old pc.

Posted by Chris on May 2, 2008 at 12:19 PM.

student
thx man, saved me some time writing this!!

you know any data-mining tools for this?

Posted by Sam on November 18, 2009 at 02:42 PM.

[email protected]
share time new friends.

Posted by nand lal hotwani on February 16, 2010 at 03:14 PM.

Random Article


Search


Feeds


Bubble.ro RSS Feed

All Categories


Articles


Aetolia - The Midnight Age
How to create the histogram of an image using PHP
How to convert an image to grayscale using PHP
How to check if an image is grayscale in PHP
Interchanging 2 variables without the use of a third
Error launching browser window:no XBL binding for browser
Convert the AOL user session collection to a MySQL database
Introduction to Matlab
Creating a customized session handling system in PHP (part II)
Creating a customized session handling system in PHP (part I)
Firefox crashing with Yahoo! Messenger
ADL Search for oDC
Video codecs explained
Browsershots
How to use Auto-Away Message with oDC
Create complete Windows XP disk with SP2 and all updates
Data Execution Prevention error message in Windows XP
Google Mars
Logarithmic scale graphs in Excel
Urban Dictionary (or wtf does l33t mean?)
Learn more about BIOS
Backup your Firefox and Thunderbird settings
Syndicate your Yahoo 360 profile
What is Google PageRank?
'Cannot Open the File: Mk:@MSITStore' Error Message
Get your Gmail with Mozilla Thunderbird
E-Books links
Change the size of your Explorer thumbnails
Remove previews from Windows Explorer
How can I turn off system beeps?
How do I disable Internet Explorer?
What are proxies or how do I protect my anonymity?
How to set aliases triggers or macros in MushClient
What is RSS?
Palm Zire 31 fast review
oDC Installation and Basic Configuration
How I built a 2x80W amplifier (using power modules)
Leech/HotLink Protection
How to block referrer detection?
How to find out your IP address
Getting started with Mushclient
What is spyware and how do I protect my PC from it?
Stumble Upon - random surfing around the web
Automatic file backup for Windows users
How can I read foreign language sites?
Protect your web surfing privacy!
What is BitTorrent?
No more ads! Adblock for Firefox
Why use Firefox instead of Internet Explorer?
How do I create my own Yahoo ID?
© Copyright 2006-2015 Bubble. All rights reserved. Sitemap - Contact