Sloan Digital Sky Survey

Observing Operations | Reviews | Survey Management

Sloan Digital Sky Survey Data Distribution Plan

Table of Contents

1. The SDSS Data Release

Data from the SDSS vary in complexity and size. The full volume of data from the project is beyond most single user or institution’s capabilities to store or utilize effectively. Many studies can be done with positions and redshifts of "only" one million galaxies. When the survey is concluded, this catalog will be easy to distribute -- perhaps as an attachment to an email message. Many projects do not require data from the complete SDSS area and hence can be accomplished as the survey progresses, by using data that are processed and calibrated during the survey. Much research requires significantly more information about each object than the simple redshift catalog provides, such as the measured photometry of objects, their morphology, pixel maps of detected object, links to repeated measurements of objects, and corrected image frames. Clearly, the data distribution plan for SDSS data must take these aspects into consideration: availability of reliable data sets during the survey, and the completeness of the information balanced against the resulting data volume.

In this chapter we present a milestone chart (Figure 1.), with specific dates and percentages of data released. Note that the longest latency time is 1.5 years, decreasing to 1 year by the end of the survey. We distinguish "essential" and "complete" access to our data products. The release of essential components is supported with the current resources. We provide an outline of a plan to raise funds for on-line distribution services at Fermilab for complete access.

Natural Timescales – Points of No Return
The survey operations already impose two well-defined "points of no return" on the data processing. The first occurs when the imaging data are determined to be good enough that target selection can be done. The second occurs when the spectroscopic reductions are determined to be good enough that a particular "tile" on the sky can be declared done. The first event is a particularly hard boundary: once we drill plates with hole positions fixed for individual targets and take spectra, we will not want to reprocess the imaging data, else the selection effects that go into the spectroscopic sample will be lost. The penalty for reprocessing and/or recalibrating would potentially include throwing out some spectroscopic data as well.

Thus there are two opposing requirements on the processing of the imaging data: one is to make sure the data are the best obtainable and have the best calibrations, while the other is to reduce the data fast enough to be able to drill plates. Commitments on timing the data distribution will be referenced to the two events that are described above, without any commitment to when they should occur with respect to the time that the data are taken. (In this regard, we are like the game of baseball: in baseball, time is measured by innings, not by the clock on the wall). We propose that for the public data distribution the latency time is measured from the ‘points of no return’. The data, which are tied to these natural timescales, form our statistical sample, and consist of the principal data products.
Quantized data release
In order to provide a meaningful versioning of the archive, we propose to release the data in yearly quanta. The complexity of the system and the expected repeated verifications of the calibration require that initially we release the statistical sample with an approximately 1.5 year delay. We expect that this policy, similar to the one adopted by COBE, will be necessary for the first two years. Then we gradually decrease the delay, so that by year five of the survey the latency time will be one year.

Figure 1. Milestones for the SDSS Data Release
Detailed Milestones for the Data Release
Figure 1 shows the precise milestones, their dates and the data fractions included. We show these for the two main data components, the photometric catalog and the spectroscopic sample. We defined several milestones, shown by the triangles on the figure. These are (i) the beginning of the survey and (ii) the points when we "quantize" the yearly data sets for the subsequent public release. These points were chosen to be at mid-year, July 1, since the survey's primary focus is the North Galactic Cap. We assume that the survey will begin in January 2000, as currently planned. We also assume that in the first two quarters of every year we observe for 6 months, the third quarter is lost to weather and we can access the survey area for 1 month in the fourth quarter. The accumulation of the raw photometric data (shown in yellow) begins in January 2000, and ends in summer 2004. The calibration of the photometry will take approximately six months in the beginning, that will shrink gradually to three months by the end (blue line on the chart). Spectroscopy will begin in 2001, and will last until the end of 2004, shown with a red, dashed line. These two lines show the fraction of data available to us at any point in time.

We propose that the data that reached the "points-of-no-return" be quantized at the mid-year milestones, and released at the times shown by the tip of the arrows. The first two releases will follow the milestone in 1.5 years, then one with 1.25 years, finally we move to a 1-year latency. The vertical positions of the arrows indicate the total percentage of the data in the public domain at that point. This latency will still give us sufficient time to completely revise our calibrations should a serious deficiency be discovered during the first half of the survey.

Details of the Data Products

We list the SDSS data products in the following Table 1, including their sizes at the end of the survey and the expected method for their distribution. The resources required to distribute these data products are discussed in the next section.

Table.1. Data Products

Product	Size	Form
1. Complete Redshift Catalog	2 GB	CD-ROM
2. Compact Photometric Catalog	60 GB	CD-ROM
3. Survey Description (Status, Calibrations)	1 GB	CD-ROM
4. Full photometric catalog	400 GB	On-line
5. Atlas Images	1.5 TB	On-line
6. Compressed Sky Map	300 GB	On-line
7. 1D Spectra	60 GB	On-line
8. Calibrations	5 GB	On-line
9. Quality Assurance Information	5 GB	On-line
10. Corrected Imaging Frames	15 TB	Tape Robot

Complete Redshift Catalog
The objects include galaxies, quasars and a selection of other sources, including stars of various properties, ROSAT and FIRST sources. The catalog will also contain all relevant photometric information, as defined in the Compact Photometric Catalog. This will be available shortly after the spectra are reduced and processed. Released according to the milestone chart.
Compact Photometric Catalog
This product contains most of the scientifically useful photometric information, in a particularly compact form, to facilitate easy distribution. In contains all objects, but the number of attributes is kept at a minimum (id, position, magnitudes, colors, size, ellipticity, position angle, errors, classification, flags), a total of 400 bytes/object. It is compressed (magnitudes are multiplied by 1000, and stored as 2-byte integers), and contains only one (the primary) observation for each object even if there are multiple epoch detections. This can be generated shortly after the point of no return, at approximately 6 months after observation, decreasing to 3 months as the survey evolves. This data set is released on CDs, according to the schedule on the milestone chart.
Survey Description and Status
The description of the survey is not a product, but rather documentation. It should be available on-line, as well as on CD-ROM. Most of this is already contained on the SDSS Sampler #1 CD-ROM, available at https://web.archive.org/web/20131211064852/http://www.sdss.org/cdrom1/index.htm. The status of the survey is the ensemble of many individual data products. We will provide these on-line with essentially no delays. These include the stripes/strips observed, fraction of raw photometric data collected to date, fraction reduced, fraction calibrated, fraction targeted, number of spectroscopic plates observed, fraction of spectra reduced, fraction of redshifts obtained, status of the instruments, weather logs, instrument logs.
Full Photometric Catalog
Parameters including positions, magnitudes, radial profiles and shape parameters for 200 million objects in 5 bands to the detection limit of the survey. The biggest difference between the compact and full catalogs is that the full catalog contains several different kinds of magnitudes, 12 radial profiles in logarithmic bins and their errors, survey coordinates, pixel coordinates, detailed calibration parameters, and their versions, various instrumental records. Also, all observations of the objects are stored here, not just the primaries. Additionally, mask files define in detail which sections of an image frame are not processed for various reasons. This catalog and the mask files will be placed on-line per the milestone chart.
Atlas Images
Cut-outs of the images of detected objects from the full image frames in 5 colors, 1 billion images in total. These are ready after the final photometric processing of the data. Due to their size they will be available online, and accessible by their locations, organized into fields. They will be put online per the milestone chart.
Compressed Sky Map
A 4x4 compressed version of the image frames after removing objects. This map, along with the atlas images can be used to approximately reconstruct the original image frames. This data set will only be created at the very end of the survey, when the whole area is contiguously covered. We will make this available shortly after.
1-D Spectra
These are extracted from the 2-D images, and the blue and red halves have been merged together. They are created during the spectroscopic reduction process, and will be ready shortly after. They will be released on-line, in sync with the redshift catalog, per the milestone chart.
Calibrations
These astrometric (position) and photometric (flux) calibration coefficients for the full image frames will be available on-line, soon after data are processed. These will be versioned so that the actual calibrations used for target selection will be available as additional work produces better calibrations.
Quality Assurance Information
These are the summaries from the data processing pipelines which quantify the observing conditions.
Corrected Imaging Frames

The flat-fielded, corrected imaging frames, which were used for object detection. We do not envisage distributing these widely, even for the SDSS collaboration, but they are stored at Fermilab in a tape vault, for legacy purposes.

Resources for the Public Archive

The most essential data products will be made available on CDROM. According to the timetable described in above, we will make master copies of these data sets available to a third party for reproduction and distribution. Although we have not initiated any discussion with CDROM publishers, the Astronomical Society of the Pacific provided this service for "The Digitized Sky Survey" produced by the Space Telescope Science Institute (AURA).

It is not practical to print the larger or more specialized data products on CDs but it is the intention of the project to maintain on online public www site to support some of the larger data sets, at least for the duration of the project. In addition, depending on the resources required, we may provide additional online access to:

Corrected imaging and spectroscopic frames. We have all the corrected frames archived at FNAL in a tape robot. A small number of frames can be retrieved from this machine with nominal delay.
Objected-oriented science database of the object catalog. The database is structured to optimize queries for scientific research. The various data sets are linked, so a query on imaging properties could return the atlas images, 1-D spectra, etc.

The financial support of this effort will require additional funds and may involve costs to the users. The details of such a plan, including the products and timescales involved, will depend on the success of our fundraising effort and on a better grasp of the necessary activities.

As a first step, Fermilab as an MOU partner in the SDSS will request that the Department of Energy allow Fermilab to undertake the distribution of the data products described in the data release plan to the scientific community with the condition that, exclusive of the SDSS collaboration, the distribution costs will be borne by the requesters. Thus the requesters will pay for material such as, but not limited to, CDs, tapes, manuals, and any special Internet charges. Furthermore, Fermilab will request that the Department of Energy (DOE) allow Fermilab to acquire the capital equipment needed to distribute the data products to the scientific community using DOE funds placed in the Fermilab/URA financial plan. Fermilab will propose to DOE that Fermilab be permitted to maintain this equipment at Fermilab with DOE funds and to provide a reasonable level of user support beyond the direct data distribution. This additional support will not exceed one FTE. The participating SDSS institutions have agreed to provide their local user support at their own expense.

Should the DOE agree to this proposal, Fermilab will develop a set of procedures for the scientific community to use when requesting data products. In the event a request would require resources beyond what is envisaged above, a requester would be required to submit a formal proposal to Fermilab, which would be treated as all other proposals to Fermilab for the use of its facilities.

In the event that the DOE does not agree to support this data distribution plan, the SDSS will enter into discussions with NASA to determine whether it will support such a plan in its entirety or in part.

Finally, the SDSS project is in the process of making a comprehensive plan for the distribution of the data not only to the astronomical community but also to the public (schools, libraries, etc.) This distribution will be accompanied by powerful database tools that will greatly enhance the accessibility of the data and the analyses that can be done. The ability to explore enormous data sets efficiently will allow queries to be sharpened and new questions to be formulated and addressed. In this spirit, the SDSS thus is not merely a research project for the participating scientists, but it is a globally accessible and completely unprecedented scientific resource. It will enable research in new astrophysical problems to be undertaken by anyone at any location.