Observing Operations | Reviews | Survey Management

Public Distribution of the SDSS Data
Public Distribution of the SDSS Data
Alex Szalay


Introduction
Data from the SDSS vary in complexity and size. The full volume of data from the project is beyond most single user or institution’s capabilities to store or utilize effectively. Many studies can be done with positions and redshifts of “only” one million galaxies. When the survey is complete, this catalog will be easy to distribute – perhaps as an attachment to an email message. Many projects do not require data from the complete SDSS area and hence can be accomplished as the survey progresses, by using data that are processed and calibrated during the survey. Much research requires significantly more information about each object than the simple redshift catalog provides, such as the measured photometry of objects and corrected image frames. Clearly, the data distribution plan for the SDSS data must take these aspects into consideration: availability of reliable data sets during the survey and the completeness of the information balanced against the resulting data volume.

In this section we present a detailed milestone chart (Figure 1.) with specific dates and percentages of data released. Note, that the longest latency time is 1.5 years, decreasing to 1 year by the end of the survey. We distinguish “essential” and “complete” access to our data products. The release of essential components is supported with the current resources. We provide an outline for on-line distribution at Fermilab for complete access.


Natural Timescales - Points of No Return
The survey operations already impose two well-defined “points of no return” on the data processing. The first occurs when the imaging data are determined to be good enough that target selection can be done. The second occurs when the spectroscopic reductions are determined to be good enough that a particular “tile” on the sky can be declared done. The first event is a particularly hard boundary: once we drill plates with hole positions fixed for individual targets and take spectra, we will not want to reprocess the imaging data, else the selection effects that go into the spectroscopic sample will be lost. The penalty for reprocessing and/or recalibrating would potentially include throwing out some spectroscopic data as well.

Thus there are two opposing requirements on the processing of the imaging data: one is to make sure the data are the best obtainable and have the best calibrations, while the other is to reduce the data fast enough to be able to drill plates. Commitments on timing the data distribution will be referenced to the two events that are described above, without any commitment to when they should occur with respect to the time that the data are taken. (In this regard, we are like the game of baseball: in baseball, time is measured by innings, not by the clock on the wall). We propose that for the public data distribution the latency time is measured from the “points of no return.” The data, which are tied to these natural timescales, form our statistical sample, and consist of the principal data products.


Quantized Data Release
In order to provide a meaningful versioning of the archive, we propose to release the data in yearly quanta. The complexity of the system and the expected repeated verifications of the calibration require that initially we release the statistical sample with a two-year delay. We expect that this policy, similar to the one adopted by COBE, will be necessary for the first two years. Then we gradually decrease the delay, so that by year 5 of the survey the latency time will be 1 year. In order to demonstrate our willingness to release data, we will do an early data release, in early 2001, which will contain the higher quality subset of the data taken before Apr 1, 2000.


Detailed Milestones for the Data Release
Figure 1 shows the precise milestones, their dates and the data fraction included. We show these for the two main data components, the photometric catalog and the spectroscopic sample. We defined several milestones, shown by the triangles on the figure. These are (i) the beginning of the survey, and (ii) the points when we “quantize” the yearly data sets for the subsequent public release. These points were chosen to be at mid-year, July 1, since the survey’s primary focus is the North Galactic Cap. We assume that the survey will begin in January 2000. We also assume that in the first two quarters of every year we observe for 6 months, the third quarter is lost to weather and we can access the survey area for 1 month in the fourth quarter. The accumulation of raw photometric data (shown in yellow) begins in Jan 2000, and ends in summer 2004. The calibration of the photometry will take approximately six months in the beginning that will shrink gradually to three months by the end (blue line on the chart). Spectroscopy will begin in 2001,and will last until the end of 2004 (red dashed line on the chart). These two lines show the fraction of data available to us at any point in time.

We propose that the data that reached the “points-of-no-return” be quantized at the mid-year milestones and released at the times shown by the tip of the arrows. The first releases (except for the early release) will follow the milestone in 1.5 years, then one with 1.25 years, finally we move to a 1-year latency. The vertical positions of the arrows include the total percentage of the data in the public domain at that point. This latency will still give us sufficient time to completely revise our calibrations should a serious deficiency be discovered during the first half of the survey.



Figure 1. Milestones and data fractions for the SDSS public data release


Details of the Data Products
We list the SDSS data products in the following Table, including their sizes at the end of the survey and the expected method for their distribution. The resources required to distribute these data products are discussed in the next section.


Product Size Form
1. Complete Redshift Catalog 2 GB CD-ROM, ftp
2. Compact Photometric Catalog 60 GB CD-ROM, ftp
3. Survey Description (Status, Calibration) 1 GB CD-ROM, www
Full Photometric Catalog 400 GB On-line, SX
Atlas Images 1.5 TB On-line, SX
Compressed Sky Map 300 GB On-line, ftp
1D Spectra 60 GB On-line, SX
Calibrations 5 GB On-line, SX, ftp
Corrected Imaging Frames 15 TB On-line, ftp


1. Complete Redshift Catalog
The objects include galaxies, quasars and a selection of other sources, including stars of various properties, ROSAT and FIRST sources. The catalog will also contain all relevant photometric information, as defined in the Compact Photometric Catalog. This will be available shortly after the spectra are reduced and processed. Released according to the milestone chart.
2. Compact Photometric Catalog
This product contains most of the scientifically useful photometric information, in a particularly compact form, to facilitate easy distribution. It contains all objects, but the number of attributes is kept at a minimum (id, position, magnitudes, colors, size, ellipticity, position angle, errors, classification, flags), a total of about 400 bytes/object. It is compressed (magnitudes are multiplied by 1000, and stored as 2-byte integers), and contains only one (the primary) observation for each object even if there are multiple epoch detections. This can be generated shortly after the point of no return, at approximately 6 months after observation, decreasing to 3 months as the survey evolves. This data set is released on CD-ROM/DVD/ftp, according to the schedule on the milestone chart.
3. Survey Description and Status
The description of the survey is not a product, but rather documentation. It should be available on-line, as well as on CD-ROM. Most of this is already contained on the SDSS Sampler#1 CD-ROM, also available at https://web.archive.org/web/20131211064852/http://www.sdss.org/cdrom1/index.htm. The status of the survey is the ensemble of many individual data products. We will provide these on-line with essentially no delays, accessible both via ftp and the SDSS web site. These include the stripes/strips observed, fraction of raw photometric data collected to-date, fraction reduced, fraction calibrated, fraction targeted, number of spectroscopic plates observed, fraction of spectra reduced, fraction of redshifts obtained, status of the instruments, weather logs, instrument logs.
4. Full Photometric Catalog
Parameters including positions, magnitudes, radial profiles and shape parameters for 200 million objects in 5 bands to the detection limit of the survey. The biggest difference between the compact and full catalogs is that the full catalog contains several different kinds of magnitudes, 12 radial profiles in logarithmic bins and their errors, survey coordinates, pixel coordinates, detailed calibration parameters, and their versions, various instrumental records. Also, all observations of the objects are stored here, not just the primaries. Additionally, mask files define in detail which sections of an image frame are not processed for various reasons. This catalog and the mask files will be placed on-line in a searchable database per the milestone chart.
5. Atlas Images
Cut-outs of the images of detected objects from the full image frames in 5 colors, 1 billion images in total. These are ready after the final photometric processing of the data. Due to their size they will be available on-line and accessible by their locations, organized into fields. They will be put on-line per the milestone chart.
6. Compressed Sky Map
A 4x4 compressed version of the image frames after removing objects. This map, along with the atlas images can be used to approximately reconstruct the original image frames. This data set will only be created at the very end of the survey, when the whole area is contiguously covered. We will make this available shortly after.
7. 1-D Spectra
These are extracted from the 2-D images, and the blue and red halves have been merged together. They are created during the spectroscopic reduction process, and will be ready shortly after. They will be released on-line, in sync with the redshift catalog, per the milestone chart (ftp/www).
8. Calibrations
The astrometric (position) and photometric (flux) calibration coefficients for the full image frames will be available on-line, soon after the data are processed. These will be versioned, so that the actual calibrations used for target selection will still be available even as additional work produces better calibrations.
9. Quality Assurance Information
These are summaries from the data processing pipelines which quantify the observing conditions.
10. Corrected Imaging Frames
The flat-fielded, corrected imaging frames, which were used for object detection. We have not envisaged distributing these widely, even for the SDSS collaboration. They are currently stored at Fermilab, in a tape vault for legacy purposes. The recent sharp drop in the price/GB for hard disks may enable us to keep these on-line, should further resources become available.

All other data products are of lower priority for public access.



Details of the Early Data Release
The early data release will contain a 600 square degree area around the equator, consisting of runs 752/756 and 94/125, plus a small selected area of about 2 square degrees. These are essentially all the data taken before April 1, 2000. We propose a two-phase release of these data.

Phase 1 will consist of a web-based interface, containing a database of gif images of the data, with a clickable access to the catalog information, and a simple search engine to the full photometric catalog. At the same time we will also provide on-line ftp access to the Compact Photometric Catalog and Calibration information, and to Status Information via the SDSS web site. These services will be built in collaboration with Jim Gray (Microsoft), and the necessary hardware will be provided by a grant from Microsoft Research. The ftp site will also contain all SDSS data (corrected frames, and all the internal files) for a 2 square degree selected area of the sky, with the necessary documentation.

Phase 2 will consist of the same data accessible through the high-performance search engine (SDSS Science Archive – SX) built for the survey. This will enable much larger and more sophisticated queries. Also, the atlas image server will be opened up for public use. The Atlas Image Server will not only supply the individual atlas images, but it will be able to reconstruct a corrected frame in FITS format from the atlas images. These services will need to be accessed by username and password, to make the support more manageable. These accounts will be provided in consultation with the funding agencies (Sloan, NASA, NSF, DOE). As the resource requirements for these services are better understood, we may open up the access policies. The resources for these services will be included in our proposal to the NSF. There is a possibility, should other resources become available that we can place the corrected frames for this region of the sky also on the ftp/www server.




Figure 2. Services of the SDSS public data release


Resources needed for the Early Data Release (Year 1)
Here we list the additional resources required for the purpose of the public data release, and do not include resources already considered for the internal data distribution.

.    www      chart      ftp        sx       atlas     total  
developm[mo] 0.5 1 1 1 1 4.5
loading[mo] 0 0.5 0.5 0.5 0.5 2
hardware 0 50 15 50 50 165
software 0 SQL 0 Objy Objy .
support[FTE] 0.45 0.05 0.05 0.4 0.05 1

For the web services we need a 0.5 month effort to update the pages with the links to the different services, update the status pages, etc. We will need a steady, ongoing effort to keep the pages in sync, and up-to-date, this requires about 0.45FTE. The chart, ftp, sx and atlas services each will require about 1 month of development effort, to make sure that there is adequate documentation, and transfer info from the internal document pages to a public document server. The data loading is also similar, about 0.5mo each, it consists of transferring the relevant data from the project domain to the public servers. The hardware requirements are reasonably modest. For the Phase 1, the hardware for the Finding Chart has been committed by Jim Gray of Microsoft Research. The necessary software licences (SQL Server) will be provided for free. For the ftp server only a modest amount of disk space is needed, and possibly a separate CPU. For SX and the Atlas image service we need a combination of CPUs and disks. The SX configuration is more CPU and I/O intensive, while we need more disks for the Atlas Image service. Ongoing support is not a big issue for the services, it involves just checking and recovery of lost data, except for SX. For SX we need an almost half-time person, to provide support with the advanced queries, and data related questions.



Resources Needed in Years 2-4 (per year)

.    www      chart      ftp        sx       atlas     total  
loading[FTE] 0.1 0.1 0.1 0.1 0.1 0.5
support[FTE] 0.5 0.1 0.1 1 0.1 1.8
HW/SW[$K] 5 15 15 35 30 100

In the ongoing part of the survey, we will need a steady 0.1 FTE to continue loading the data from the internal data systems to the public ones. Support requirements are 0.5 FTE for the www service, 1 FTE for SX, and 0.1 FTE for the rest. The combined hardware /software requirements are anticipated to be around $100K/year, staying with a level budget will enable us to improve the hardware, since the price/performance is expected to drop at an exponential rate. It is based on 60% of the first year expenditure.