The Fuse Spreadsheet Corpus

This page contains information necessary to reproduce Fuse, a corpus of 2,127,284 spreadsheet URLs (and HTTP responses), and 249,376 unique spreadsheets. The current Fuse set consists of Common Crawl files from Winter 2013 through December 2014.

Downloading the Metadata

The metadata is available as schema-free JSON records. The full set of JSON records are useful when you are interested in duplicates, for example, to understand how a certain file propagates across the Internet. The deduped JSON records are typically more appropriate when you are only interested in the contents of the files themselves, not their origins, given that the origin information is selected from an arbitrary JSON record.

Fuse Survey

Please complete our short, two-question Fuse Dataset Usage survey. Your responses will help us maintain funding for Fuse-related research.

Citing Fuse

Fuse is an academic work. To cite Fuse, refer to the corresponding Mining Software Repositories paper:

Titus Barik, Kevin Lubick, Justin Smith, John Slankas, and Emerson Murphy-Hill. "Fuse: A Reproducible, Extendable, Internet-scale Corpus of Spreadsheets." In: Proceedings of the 12th Working Conference on Mining Software Repositories (Data Showcase), Florence, Italy, 2015.

Errata: "We also discovered that =SUM(R[-3]C:R[-1]C) is the most common formula, in which a cell is the sum of the three cells to its left." This formula is actually the sum of the 3 cells above the formula cell, not to its left.

Importing into MongoDB

The JSON metadata consist of a single file, with one JSON record per line. This format is suitable for direct import into MongoDB, but suitable for any tool that can handle one JSON record per line.

mongoimport --db test --collection web --file web.analysis.json

Downloading the Binaries

We have provided all of the deduped binary spreadsheets in a single, compressed archive, fuse-binaries-dec2014.tar.gz (6.9 GB, sha1:1b6433440b60113d01fa0a93eb9b5d6f35706b45).

Downloading the Source Code

The source code to Fuse is available on GitHub at Spreadsheet-Common-Crawler. Our source code is published under a BSD license.

Expand Dedup Set of Spreadsheets into Full Set

If you really need to, you can take the deduped binary spreadsheet and expand them back into their full set. The first column of the fuse-dedup.map-dec2014.txt (51 MB, sha1:96ecb17f933d132b1b5fe59dee75180a6d6ad91e) file contains the WARC-Record-ID of the original spreadsheet. The second column contains the equivalent deduped WARC-Record-ID. Thus, for each record, one must simply copy the deduped file to the expanded file. There are any number of ways to do this:

cat cc.dedup.map.txt | \ 
awk '{print "src/" $2 "\n" "dst/" $1 "\n"}' | \ 
xargs -n 2 cp

The result of this process is 719,223 binaries from the original 249,376 binaries.

EMR Parameters

We used the following parameters for EMR:

Match and Extract

-m, mapred.map.tasks.speculative.execution=false, 
-m, mapred.reduce.tasks=0, 
-c, fs.s3n.ssl.enabled=false, 
-m, mapreduce.map.java.opts=-Xmx4096m, 
-m, mapreduce.map.memory.mb=4096, 
-m, io.file.buffer.size=65536, 
-m, mapreduce.task.timeout=1200000, 
-y, yarn.scheduler.maximum-allocation-mb=4096, 
-y, yarn.nodemanager.resource.cpu-vcores=1, 
-y, yarn.scheduler.minimum-allocation-mb=4096

Filter, Plugin, and Merge

-m, mapred.map.tasks.speculative.execution=false, 
-m, mapred.reduce.tasks=0, 
-c, fs.s3n.ssl.enabled=false, 
-m, io.file.buffer.size=65536, 
-m, mapreduce.task.timeout=1200000

The important thing is to disable speculative execution, since map functions are not idempotent. In other words, our maps have side-effects because they write to S3.

Reduce is inapplicable to our pipeline.

Theoretically, the number of instances can scale infinitely, since our technique is embarrassingly parallel. In practice, S3 rate limits are the critical bottleneck. For more information, see Request Rate and Performance Considerations. S3 requests must also be fault-tolerant, as they are not guaranteed to be successful. Finally, S3 uses an eventual consistently model, so files may not appear immediately.

Example JSON Record with Plugins

Consider an example record, such as the WARC-Record-ID: 000021ae-58b0-45de-9c1d-92a1d35f07df. You can obtain this record with the following MongoDB query:
db.sdedup.find({'WARC-Record-ID': 
  '<urn:uuid:000021ae-58b0-45de-9c1d-92a1d35f07df>'})


For questions, contact Titus Barik at tbarik@ncsu.edu.
Last updated: September 7, 2015.