This page contains information necessary to reproduce Fuse, a corpus of 2,127,284 spreadsheet URLs (and HTTP responses), and 249,376 unique spreadsheets. The current Fuse set consists of Common Crawl files from Winter 2013 through December 2014.
The metadata is available as schema-free JSON records. The full set of JSON records are useful when you are interested in duplicates, for example, to understand how a certain file propagates across the Internet. The deduped JSON records are typically more appropriate when you are only interested in the contents of the files themselves, not their origins, given that the origin information is selected from an arbitrary JSON record.
sha1:76781675bb4603b30fa636130a979853018545c7)
sha1:b3caee7c39ce8e255028a50ebf8dfa14b7092091
)sha1:50f880258ac2ebbd407e3f718e57ac2ab23ca03e
)sha1:9aea74288ce99129282d8dfed78aeed11dc911ee
)Please complete our short, two-question Fuse Dataset Usage survey. Your responses will help us maintain funding for Fuse-related research.
Fuse is an academic work. To cite Fuse, refer to the corresponding Mining Software Repositories paper:
Titus Barik, Kevin Lubick, Justin Smith, John Slankas, and Emerson Murphy-Hill. "Fuse: A Reproducible, Extendable, Internet-scale Corpus of Spreadsheets." In: Proceedings of the 12th Working Conference on Mining Software Repositories (Data Showcase), Florence, Italy, 2015.
Errata: "We also discovered that =SUM(R[-3]C:R[-1]C)
is the most common formula, in which a cell is the sum of the three cells to its left." This formula is actually the sum of the 3 cells above the formula cell, not to its left.
The JSON metadata consist of a single file, with one JSON record per line. This format is suitable for direct import into MongoDB, but suitable for any tool that can handle one JSON record per line.
mongoimport --db test --collection web --file web.analysis.json
sha1:1b6433440b60113d01fa0a93eb9b5d6f35706b45
).
sha1:160158fdc209a412770b6bc47c7d1e62c592d981
) contains the SHA-1 signature for the deduped files.The source code to Fuse is available on GitHub at Spreadsheet-Common-Crawler. Our source code is published under a BSD license.
If you really need to, you can take the deduped binary spreadsheet and expand them back into their full set. The first column of the fuse-dedup.map-dec2014.txt (51 MB, sha1:96ecb17f933d132b1b5fe59dee75180a6d6ad91e
) file contains the WARC-Record-ID
of the original spreadsheet. The second column contains the equivalent deduped WARC-Record-ID
. Thus, for each record, one must simply copy the deduped file to the expanded file. There are any number of ways to do this:
cat cc.dedup.map.txt | \ awk '{print "src/" $2 "\n" "dst/" $1 "\n"}' | \ xargs -n 2 cp
The result of this process is 719,223 binaries from the original 249,376 binaries.
We used the following parameters for EMR:
-m, mapred.map.tasks.speculative.execution=false, -m, mapred.reduce.tasks=0, -c, fs.s3n.ssl.enabled=false, -m, mapreduce.map.java.opts=-Xmx4096m, -m, mapreduce.map.memory.mb=4096, -m, io.file.buffer.size=65536, -m, mapreduce.task.timeout=1200000, -y, yarn.scheduler.maximum-allocation-mb=4096, -y, yarn.nodemanager.resource.cpu-vcores=1, -y, yarn.scheduler.minimum-allocation-mb=4096
-m, mapred.map.tasks.speculative.execution=false, -m, mapred.reduce.tasks=0, -c, fs.s3n.ssl.enabled=false, -m, io.file.buffer.size=65536, -m, mapreduce.task.timeout=1200000
The important thing is to disable speculative execution, since map functions are not idempotent. In other words, our maps have side-effects because they write to S3.
Reduce is inapplicable to our pipeline.
Theoretically, the number of instances can scale infinitely, since our technique is embarrassingly parallel. In practice, S3 rate limits are the critical bottleneck. For more information, see Request Rate and Performance Considerations. S3 requests must also be fault-tolerant, as they are not guaranteed to be successful. Finally, S3 uses an eventual
WARC-Record-ID: 000021ae-58b0-45de-9c1d-92a1d35f07df
. You can obtain this record with the following MongoDB query:
db.sdedup.find({'WARC-Record-ID': '<urn:uuid:000021ae-58b0-45de-9c1d-92a1d35f07df>'})