Skip to content

Number Density Modelling Algorithm for DESI Target Selection

Notifications You must be signed in to change notification settings

tanveerkarim/NDM-paper-scripts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Author: Jae Hyeon Lee
Last updated: February 12, 2018
E-mail: jaekor91@gmail.com

This README documents scripts used to produce results reported in the NDM selection method paper (2018). It also details scripts and their products not reported in the paper but documented here for reproducibility and complete update on the status of the method.

Warning: I emphasize that the primary purpose of this documentation is for reproducibility rather than the ease of their adaptation to other projects. Hence I have not made a substantial effort to maintain the codes at a professional level beyond general readability and cleanliness.

README convention (for the author):
- "/----" to denote the beginning of a new grouping or section. 
- Five "\n" between sections.
- Use only "-" for lists.
- "!" means the particular analysis is not performed to completion.

Contents: 
- Organization
- Software requirements
- utils and NDM_models modules
- Data download and acquisition
- Data pre-processing
- Data quality checks
- MMT observational study scripts/data archive
- Application
- List of challenges




/---- Organization
Here I provide a high level overview of the documentation as a reminder for myself. On the author's local repository, the current directory "/scripts/" is embeded under "/NDM-paper/" directory as below along with other sister directories.
/NDM-paper
	/article
	/data
		/derived
		/DEEP2		
		/DR5		
		/MMT_data	
		/spitzer		
		tycho2.fits	
	/scripts
	/figures

- "article" contains the manuscript .tex file and all other relevant packages, figures, bibliography, etc. 
- "data" contains all primary data files used for the project. Any derived data products such as cross-matched catalogs are stored in "/data/derived/" directory. 
- "figures" contains all graphic products. Unless mentioned otherwise, all figures saved on the flat directory level.
- "scripts" contains all Python scripts used to produce any results.






/---- Software requirements
Unless mentioned otherwise, the work contained here is based on Python 2.7.12. Below is a non-exhaustive list of key packages used: 
- numpy v1.9.1
- astropy v1.1.1
- scipy v0.19.0
- extreme-deconvolution-1.4 (https://github.com/jobovy/extreme-deconvolution)

The rest is based on Python 3.5.2. In particular, Random Forest (RF) module from scikit-learn v0.19.1 was applied to DEEP2-DECaLS DR5 data. 

These are not stringent requirements, and there is no reason to believe that other compatible versions of Python would produce different results.





/---- utils and NDM_models
All author defined functions are collected in utils and NDM_models. The latter is a collection of classes and methods that are specific to the NDM method whereas the former provides general convenience functions. For convenience, I employ the bad practice of importing all external modules, functions, and classes at the same time via "from utils, NDM_models import *". 





/---- Data download and acquisition
In this section, I document how key data for the project were obtained. All of these data are available via (*provide URL*) or can be obtained from the author in a .zip file via e-mail. 

- DEEP2 photometric catalogs: Downloaded from the DEEP2 survey website using below commands.
wget http://deep.ps.uci.edu/DR4/data/pcat_ext.21.fits.gz
wget http://deep.ps.uci.edu/DR4/data/pcat_ext.22.fits.gz
wget http://deep.ps.uci.edu/DR4/data/pcat_ext.31.fits.gz
wget http://deep.ps.uci.edu/DR4/data/pcat_ext.32.fits.gz
wget http://deep.ps.uci.edu/DR4/data/pcat_ext.33.fits.gz
wget http://deep.ps.uci.edu/DR4/data/pcat_ext.41.fits.gz
wget http://deep.ps.uci.edu/DR4/data/pcat_ext.42.fits.gz
wget http://deep.ps.uci.edu/DR4/data/pcat_ext.43.fits.gz

- DEEP2 window functions: Downloaded from the DEEP2 survey website using below commands.
wget http://deep.ps.uci.edu/DR4/data/windowf.21.fits.gz
wget http://deep.ps.uci.edu/DR4/data/windowf.22.fits.gz
wget http://deep.ps.uci.edu/DR4/data/windowf.31.fits.gz
wget http://deep.ps.uci.edu/DR4/data/windowf.32.fits.gz
wget http://deep.ps.uci.edu/DR4/data/windowf.33.fits.gz
wget http://deep.ps.uci.edu/DR4/data/windowf.41.fits.gz
wget http://deep.ps.uci.edu/DR4/data/windowf.42.fits.gz

- DEEP2 redshift catalogs: These were obtained from John Moustakas and are described in a Tech Report (DESI docDB-912). The redshift catalogs from Fields 2, 3, and 4 are named deep2-f2-redz-oii.fits, deep2-f3-redz-oii.fits, and deep2-f4-redz-oii.fits in the data directory.

- DEEP2 color selection: The file is a union list of DEEP2 photometric catalogs with an additional information regarding color selection. It was obtained from Jeff Newman and named color-selection.txt.

- survey-bricks-dr5.fits: A DECaLS survey product that contains information about each processed bricks. Downloaded using the below command.
wget http://portal.nersc.gov/project/cosmo/data/legacysurvey/dr5/survey-bricks-dr5.fits.gz

- DR5 Tractor catalogs in DEEP2 Fields 2, 3, and 4: Using the "survey-bricks-dr5.fits" file above and a short script (generate-tractor-download-scripts.py), the below download scripts were generated:
tractor-download-d2f2.sh
tractor-download-d2f3.sh
tractor-download-d2f4.sh

- Tycho-2 catalog: The file was obtained from one of the DESI collaborators though I cannot remember whom. Named tycho2.fits in the data repostiory.

- spitzer catalogs: (add further description)





/---- Data pre-processing
After downloading/acquiring the DECaLS DR5 and DEEP2 catalogs, they are pre-processed into convenient forms as described in this section.

Combine all downloaded Tractor files by DEEP2 Field
- Notes: Combine all Tractor files by DEEP2 Field, append Tycho-2 stellar mask column, and mask objects using DEEP2 window funtions. Note that no other catalog quality cuts were made at this point. Tycho-2 masks were appended but not imposed as may be altered in the future.
- Script used: combine-Tractor-bricks.py
- Products:
	- DR5-Tractor-D2f*.fits

Combine DEEP2 catalogs
- Notes: Combine photometric and redshift catalogs (by matching the two and appending select columns from redshift catalogs to photometric catalogs). DEEP2 window functions are applied and Tycho-2 mask column is applied to the objects. For more information, please see produce-deep2-f234-photo-redz-oii-prints.txt. Refer to generate_class_col function for the exact classification scheme used.
- Script used: produce-deep2-f234-photo-redz-oii.py
- Products: 
	- deep2-f**-photo-trimmed.fits: Intermediate products after applying DEEP2 window mask and BADFLAG quality cut. Consult the code.
	- deep2-f**-photo-redz-oii.fits: Final products of the combined DEEP2 photometric and redshift catalogs.

Combine DEEP2 and DR5
- Notes: We make the following cuts prior to cross matching.
	- Common: Tycho masks and window functions.
	- DR5: g < 24.25, grz_allmask == 0, grz_ivar > 0, brick_primary == 0. grz_allmask might be too stringent but this is done for convenience. This condition may be loosened down the road. We match DEEP2 catalogs onto DR5. Please refer to combine-Tractor-DEEP2-prints.txt
- Script used: combine-Tractor-DEEP2.py
- Products:
	- unmatched-deep2-f**-photo-redz-oii.fits: A subset of DEEP2 combined catalogs that did not find matches in DR5.
	- DR5-matched-to-DEEP2-f**-glim24p25.fits: Original DR5 combined catalogs with DEEP2 information appended.

Combine DEEP2 and DR5 -- No Mask version
- Notes: The same as above except I do not apply any Tractor mask or Tycho mask except for DEEP2 spectrosopic window function. Please refer to combine-Tractor-DEEP2-prints-NoMask.txt.
- Script used: combine-Tractor-DEEP2-NoMask.py
- Products:
	- unmatched-deep2-f**-photo-redz-oii-NoMask.fits: (Not saved)
	- DR5-matched-to-DEEP2-f**-glim24p25-NoMask.fits

ConvNet examples
- Notes: This script select DR5 objects that will be used as training examples for a ConvNet whose purpose is to reject "bad" objects based on images. The rationale for excluding DEEP2 unobserved objects but including unmatched objects is that we want to understand why certain DR5 objects were unmatched in DEEP2 whereas if an object is matched in DEEP2 but was previously unobserved we can assume that those objects can be characterized by the included set. Objects that have "too close" to low redshift loci are excluded. 
- Script used: produce-ConvNet-examples.py
- Products:
	- DR5-matched-to-DEEP2-glim24p25-NoMask-ConvNet-examples.fits

Estimate the intersection set area (DEEP2 window function AND Tycho mask)
- Notes: Monte Carlo estimation of the intersection set area. Throw random points in the footprint and cross-match.
- Script used: estimate-spec-area-DR5-matched.py
- Products: 
	- spec-area.npy: Areas of the intersection set: [0.70446343918138887, 0.85027729873967495, 0.57165023559554096] 2.12639097352
- Figures:
	- estimate-area-monte-carlo-DR5-matched.png: Shows the matched vs. unmatched MC randoms.

Definition of asinh magnitude
- Notes: Throughout this work, we use asinh magnitude which is defined by a single parameter for each band which is comparable to typical 1-sig error.
- Script used: define-asinh-mag.py
- Results: b parameter for g, r, z, and oii.
	g 0.0285114
	r 0.0423106
	z 0.122092
	oii 0.581528277909





/---- Data quality checks
After pre-processing the data into convenient forms, we perform sanity checks on the data quality.

! RADEC plots of DEEP2 and DR5 (all, matched, unmatched)
- Notes: 3 x 3 RA/DEC plots. Columns refer to Fields and rows to (all, matched, unmatched) in order.
- Script used: plot-RADEC-All-Matched-Unmatched.py
- Product: 
	- RADEC-DEEP2-All-Matched-Unmatched.png
	- RADEC-DR5-All-Matched-Unmatched.png

DEEP2 raw and weighted number comparison
- Notes: Based on the combined DEEP2 photometric and redshift catalogs, the weighted total number using DEEP2 target probability must be comparable to the raw total number in each Field. In each Field, the weighted number is greater than the raw total density by ~5%. Also, there are about 20% more raw objects in Field 4 compared to Field 2 (see DEEP2-make-table-prints.txt).
- Script used: DEEP2-make-table.py
- Product:
	- DEEP2-make-table-prints.txt

Intersection set raw and weighted number comparison
- Notes: Based on the intersection set, the weighted total number using DEEP2 target probability must be comparable to the raw total number in each Field. There are about 20-30% more objects in Field 4 compared to Field 2 (see Intersection-make-table-prints.txt). There are about 10-20% more objects in the weighted. This means that a selection based on the number density model trained on weighted data set may be too conservative and return fewer than expected number of objects. In order to take into account the number of unmatched objects (about 6-8%), we simply scale the intersection set by the fraction of unmathced objects. (Note that this does not solve the problem of weighted vs. raw density.)
- Script used: Intersection-make-table.py
- Product:
	- Intersection-make-table-prints.txt	

DEEP2 ELG properties comparison by Field
- Notes: Based on the combined DEEP2 photometric and redshift catalogs, the target probability weighted properties of ELG in the three Fields should be comparable. Compare the following quantities: number density, redshift distribution, OII distribution, BRI-magnitude-color distribution.
- Script used: DEEP2-ELG-properties.py
- Product:
	- DEEP2-redz-oii-hist.png
	- DEEP2-BRI-hist.png

DECaLS DR5 number density comparison by Field
- Notes: Based on combined DECaLS DR5 catalogs only, the following quantities should be comparable given g < 24.25: grz-magnitude distribution and total number density.
- Script used: DR5-properties.py
- Products:
	- DR5-grz-hist.png
	- DR5-mag-depths.png
	- depths-all-objects.png: grz magnitude 5-sig depths computed from the Tractor quoted errors.
	- depths-rexp.png: grz magnitude 5-sig depths of r_exp [0.35, 0.55] objects.

DECaLS DR5 number density comparison to external calibration dataset.
- Notes: We plot area-normalized g-mag histograms for external calibration dataset (DR5) and DEEP2 F234 Tractor data. We find that there are substantial deficit in Field 2 and excess in Field 3 and 4.
- Script used: Tractor-g-hist-comparison-cal-DEEP2.py
- Produts:
	- g-hist-comparison-F{2..4}.png

DEEP2-DECaLS DR5 intersection set (g < 24.25)
- Notes: The properties of the intersection set (only DEEP2 matched DR5 objects) in different Fields should be comparable to each other: grz-magnitude distribution (weighted AND unweighted), grz-color distributions (division by class), redshift and OII distribution (ELG). 
- Script used: Intersection-properties.py
- Products:
	- Intersection-grz-hist-raw.png
	- Intersection-grz-hist-weighted.png
	- Intersection-grz-color-by-class-field.png: raw only with g > 21.
	- Intersection-grz-color-by-class.png: raw only with g > 21.	
	- Intersection-redz-oii-hist-ELG.png: weighted only

DECaLS DR5 unmatched to DEEP2 set
- Notes: The generated figures are used to confirm visually that the objects in the unmatched set has the same grz-color-magnitude distribution as that of the matched set.
- Scipt used: DR5-matched-vs-unmatched-properties.py
- Products:
	- DR5-matched-vs-unmatched-grz-hist-normed-by-field.png
	- DR5-matched-vs-unmatched-grz-color-by-field.png

! DEEP2-DECaLS DR5 intersection set (g < 24.25) ELG selection challenge plots
- Notes: These plots are used to illustrate the challenge of ELG target selection problem.





! /---- MMT observational study scripts/data archive
These files/data are included to document the analysis that was performed. The code may not be up to date.

The following scripts were used to process the MMT study data:
append-DR1-MMT-DR3.py	process-MMT-16hr-1.py	process-MMT-16hr-2.py	process-MMT-23hr.py	produce-DR1-MMT-DR3.py

The original data files are stored in ../data/MMT_data/

Any dervied files can be found in ../data/derived/





/---- Application
In this section, I document exactly how all the figures and results reported in the NDM paper are produced. I include additional scripts whose results are not presented in the paper.

All relevant convenience functions are collected in utils.py and any integral facilities to the NDM method are included in DESI_NDM class module in NDM_models.py.

DESI_NDM class module is able to perform the following tasks:
- Given training data, perform intrinsic number density fitting. This includes broken power law fititng and color distribution fitting for all five classes defined in the paper. 
- Generate and retain MC samples from the intrinsic densities.
- Convolve the intrinsic densities with noise models to generate the observed densities.
- Define a selection region given a desired total number density. (Implement the two regularizing scheme: smoothing and magnitude dependent cruds.) Given grz-fluxes of an external catalog, return the NDM selection as a vector.
- Provides a prediction for the properties of the selection including grz and OII flux distributions and the redshift distribution.

To keep DESI_NDM as a light-weight class, I made a conscious effort to collect any convenience functions to utils.py.

Intrinsic density fitting
- Notes: The script documents how the DESI_NDM class module is used to fit intrinsic densitiy models to the training data. For the color distributions, following the model fitting with various number of component gaussians, I visually inspect the fits and pick a model with the lesat number of components that can adequately reproduce the data.
- Script used: fit-intrinsic-densities.py
- Products: ./data/derived/
	- broken-power-law-params-cn*.npy: Broken power law params. 	
	- Intersection-dNdm-by-class.png: Plot of dNdm by class and their fits.
	- MoG-params-cn*-K*.npy: MoG params. K denotes number of components. 
	- Intersection-colors-cn*-K*-pair*.png: Plot of color distributions and their fits with varying number of components.
		- For each pair of colors include 
			1) scatter plot of data and MC draws.
			2) Histogram of data and MC draws.
		- This means for Gold and Silver, we need three panels corrresponding to each pair. 
		Pair number
			- 0: g-z vs. g-r
			- 1: g-z vs. g-oii
			- 2: g-r vs. g-oii

Fiducial DR5 selection
- Notes: The script generates fiducial selection with parameters given in the NDM paper. 
- Script used:
	- run-NDM-fiducial.py
- Product: 
	- run-NDM-fiducial-results.txt
	- /scripts/DR5-NDM1/ contains all results inclindg boundary figures, cell_select grid, and marginal efficiency arrays.

Utility metric variation
- Notes: Fiducial parameters except U_Gold = 5 and U_NoZ = 1.
- Script used:
	- run-NDM-util-var.py
- Product: 
	- run-NDM-util-var-results.txt
	- /scripts/DR5-NDM2/: Corresopnds to U_Gold = 5 case.
	- /scripts/DR5-NDM3/

Selection boundary figures
- Notes: For certain selection regions, we use the below script to produce figures to go into the paper. In particular the figure show samples from the calibration dataset over plotted so that the reader can get a sense of how many objects are within the selection region.
- Script used: NDM-fudical-selection-plots.py
- Products: ./figures/fiducial-boundary-*.png

dNdm and dNdz for the fiducial and U_Gold = 5.
- Notes: We make dNdm and dNdz plots to compare fiducial and U_Gold = 5 cases. dNdm is based on the external calibration dataset and dNdz is based on DEEP2 dataset only.
- Scripts used: 
	- NDM-dNdz-fiducial-Ugold5.py
	- NDM-dNdm-fiducial-Ugold5.py
- Products:
	- NDM-dNdz-fiducial-Ugold5-DEEP2F**.png
	- NDM-dNdm-fiducial-Ugold5-calibration.png

One vs. two priority classes
- Notes: A back of the envelop calculation based on the fiducial selection. Compares one vs. two priority system cases.
- Script used: NDM-fiducial-one-vs-two-priority.py
- Products: NDM-fiducial-one-vs-two-priority-results.txt

Apply FDR to DEEP2
- Notes: Script applies FDR cut to DEEP2 data.
- Script used: apply-FDR-to-DEEP2.py
- Product:
	- apply-FDR-to-DEEP2-results.txt

Depths variation:
- Notes: We explore the effects of varying the noise model while keeping the fiducial selection at N_desired = 2400. If the data is noisier than the resulting selection will have poorer efficiency.
- Script used: run-NDM-depths-var.py
- Product: 
	- ./DR5-NDM4/eff_arr.npy
	- ./figures/NDM-depths-var.png

Marginal efficiency plots for fiducial and U_Gold = 5 cases. 
- Script used: marginal_eff_plots_fiducial_Ugold5.py
- Product:
	- ./DR5-NDM1/marginal-eff.png
	- ./DR5-NDM2/marginal-eff.png

Calibration files:
- DR5: From previous work. Fluxes were properly de-reddened. Refer to DR5-large-Nexp2-area.py. Though the code is not supported, archived here for the record.

Notes on marginal efficiency:
- Each cell of the grid has an efficiency appxoriamted as the conditional probability of good (desirable for DESI) objects. 
- Each discrete NDM solution gives an ordering of these cells, which then defines marginal efficiency as the efficiency of the last included cell given a desired threshold number.
- To compute marginal efficiency, we sum the total number of good objects and divided by the total. To do this we use external calibration data and the conditional probability we compute (via histogramming).

Notes on color term correction:
- To apply the selections, color term corrections should be applied to the intrinsic MC samples and the calibration dataset. Note we shift DR5 fluxes to DR4/6 fluxes based on the correction quoted by Anand, DESI workshop presentation (2017).

DR5-NDM-Obiwan-samples
- Notes: These samples were genereated as a part of DEEP2 Field 2 incompleteness study. Kaylan will inject these samples to real images and run Tractor on them. Afterwards, we can search for any systematic biases that may have resulted in the incompleteness we observed.
- Script used: 
	- generate-NDM-Obiwan-samples.py
- Products: All figures and files are saved in the folder "/scripts/DR5-NDM-Obiwan-samples/"







/---- List of challanges
Here I make a list of unresolved problems and challenges that should be addressed in future work.

Data-related challenges
- DEEP2 survey had R < 24.1 cut whereas we impose g < 24.25 cut to DESI imaging catalogs. In the project, I assumed that the properties of DECaLS objects not matched in DEEP2 can be extrapolated based on the matched set without any modification. This may not be true.
- There are discrepancies between the raw and weighted number densities. For the intersection set, this is as large as 10-20%.
- Target probability calculation and OII flux measurements were never checked independently by me.
- Cataloging incompleteness is observed in DECaLS catalogs when comparing DEEP2 F2 and F34. The incompleteness issue should be more systematically studied.

Modeling-related challenges
- The modeling approach presented in the paper is Frequentist. A fully Bayesian version might be desirable to eliminate overfitting behavior.
- So far we have been using normal flux errors according to some limiting magnitudes. We may want to use error models derived from forward modeling simulation study done with Obiwan code.
- Selection region convergence: Insufficient MC convolution and sampling in general may leave noisy features. Would it be possible to use random grid to solve this problem?
- Efficient representation of selection region: A selection region represented by a set of cells in a chosen color-color-magnitude space may be more efficiently represented through other means such as tesselation.
- Redshift modeling: Due to cosmic variance, redshift distribution is not explicitly modeled. 

Post SV-period modeling
- During Science Validation period, DESI will gain 500K additional spectra. How should the selection region be updated to incorporate information gained from these observations?

About

Number Density Modelling Algorithm for DESI Target Selection

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published