Intro

Conference Stats

2k+ papers accepted
per category:

My experience

I mostly looked on Diffusion/GAN-related, architectures, training tricks, a little bit few/zero-shot, segmentation/detection.
Overall ~110+ papers are covered (more than 5% of the conference) + some workshops + sometimes you can find random ideas
Overall conference felt much less useful than ICCV2019 which I attentded in person previously (notes). Maybe the amount of interesting simple ideas is somewhat more depleted? Or the things I was interested in became more narrow? Or I became ~~older and more stup~~ more experienced? On most poster sessions 1 hour was sufficient to check all useful papers (if you don't want to weight 10-15 mins/poster to speak to author). Anyway, conference is in big part for socializing, and research is already dead outdated so it was not bad.
Some posters were missing! or at the wrong place! some others were on workshops as well as on main conference or 2 consequtive days/poster sessions in 1 place... my strategy was just walking over all the posters, problem is some posters are added late, some are removed early - so you never know you checked everything. 1 time I tried to look specifically for 1 poster, checked it's allocated place as well as all posters in general - didn't found any... after which I stopped using the schedule as guidance

If you're reading this for some weird reason and you're not me

Recommended order is

Main Insights
Workshops and Keynotes - relatively short sections but they can provide some high-level interesting ideas
Paper description format
Check papers sections you're interested in (see in table of contents)
Maybe search for papers I rated highest: 9/10, 8/10 (usually they have comment on why)
~~Do your own summary, star the repo, subscribe~~

Main Insights

Data is crucial (highest quality data)
- EMU from Meta is tuned on just 2k but extremely high quality images
- Dalle3 report is all about how important is text-image matching in the data
- Alyosha Efros's talk
- ... (every 2nd paper/talk)
- Obvious? Sure, any self-respected ML practitioner learns it in year1, but after hearing it so many time you feel it
Domain experts might be of great help
- Photography quality labelling (there're agencies who label/relabel smartphone cameras quality by many params, also in EMU domain experts helped to select best images, ...)
- Mentioned in DeepMind's keynote (in the context of "don't try to blindly apply your 'genious' methods, - consult will it make sense / what is needed / etc")
Multitask training reduces data requirements by level of magnitude
- and technically might be equivalent, i.e. does not damage the quality
- in 2017 or 2019 one of best papers was about ~"which tasks we can combine in multi-task training to improve quality of all"
- *that might be applicable for huge models though, not for tiny ones
Self-supervised training: be careful, ensure you don't do something stupid accidentally (e.g. with large batch of text-image pairs using all other pairs as negative examples leads to incorrect negatives)
DeepMind keynote on project selection
There's apparently a way of encrypted data inference (so google or whoever is your inference cloud don't know the data you're processing) - not that fast though

Paper description format

Disclaimer: the notes are biased. Also in many cases I spent very few time on paper so there might be some inaccuracies/mistakes.

[x/10] (paper main idea description) Paper title my commentary

(some images if idea looked interesting enough and can't be described with a few words and I wasn't lazy)

Ratings are the more the better. Rating is ~usability/novelty of the paper to me (read: "very biased"). You can probably Ctrl+F 9/10, 8/10, 7/10, etc

I mostly grouped papers by primary topic, but there're exceptions. e.g. if the only interesting thing in the paper for me was loss I'd put it to losses section regardless of the main topic.

Workshops

Video workshop

Black box interfaces (on ux)
- chat model is way more convenient for humans.
- some signals are way easier to provide not with text but image (ref, controlnet, etc)
- "A good conceptual model let's users predict how input controls affect the output"
- (just a good question, no great answers I remember) "Low retention rate of GenAI tools, what is missing?"
Video understanding
- (tldr) - we really need hierarchical models
- unsupervised seems to work better than supervised now
- (historical note) In 2008 was possible to recognize actions like running, sitting down in a car. quite impressive
- Vid2seq paper can produce dense captions
- Unsolved video understanding: long term understanding, embodied video understanding (predicting future, potential, likely, interesting, etc)
  - my thoughts:
    - long-term probably needs just some hierarchical model (like different levels of abstraction in summaries).
    - embodied understanding - just learn to predict the future (also should work great in combination with RL/robotics, curiosity, etc)
    - overall does not look that problematic, 1-2 years and we'll be there easy
- is scaling LLMs the answer? author provide we need 1000x more data/capacity for videos to scale it directly (which is actually just ~20 years in Moor's law). also llms does not capture 4d world complexity (so we need multimodal something).
AI films (~3-10min movies showcast)
- are very different
- in general artists do what before but in a new way, sometimes simpler
- (have not seen higher [than non-AI] quality works but there should be some)

Continual learning workshop

Still far from solved, catastrophic forgetting
minor ideas are to update teacher model with ema student or unfrozen batchnorms - works but not too good

Quo Vadis / State of Computer Vision

shortlist of best thoughts
- Over fitting is because of multiple epochs - let's train on infinite stream of data instead
- the word computer in "computer vision" is accidental (vision is central, computers not important in 100 years)
- New crisis (llms) - focus on creativity instead
Alyosha Efros's talk (fun to watch, mostly memes, main point - data is king, use good data)
Lana Lazebnik's talk (on modern science pace, no specific solutions mostly just sharing problems)
- my thoughts (esp after talking with many PhD students on conference) [speculations]:
  - there're 2 kinds of papers - important/fundamental/groundbraking (new problem introduced & solved, completely new level of quality achieved, conceptually new paradigm in solving problems) and incrimental (tiny incrimental improvements in quality, minor hyperparameter change study, datasets exploration)
  - the first type takes a lot of time, in many cases your ideas do not work at all, in some cases (~idea is on the surface) other people publish it faster than you can complete research
  - the second type can be done in really short time, even 1 week start to finish if you try hard. it's sort of not that useful but you'll get your publications/citations/whatever
  - I noticed most PhDs focus on type 1 and fail to publish or focus on type 2 and feel bad about it (or not)
  - probably reasonable strategy is to find balance, spend some time on incremental and some on foundational (split time in week or allocate few months for 1 and few for 2)
  - todo: write type 1 ideas not yet implemented (it was my final todo but I'm too lazy now, maybe will do if repo gets 25+ stars (which it safely won't, right?))
Antonio Torralba's talk (current LLM crisis -> upcoming CV/entire industry crisis -> what to do with it)
- great talk full of memes and still valuable
- basically several lessons from history
  - from most recent: before 2012 people has to know all classic computer vision/ML staff. and still nothing worked with good enough quality for practical problems. now you "stack more layers" and it works. is it bad? people feel old knowledge is useless? not really, many feel excited
  - the greeks theory of extramission (emission) theory as first model of vision
  - the original motivation of images/art is ~"to have wild animals at home w/o it being too dangerous so they step on you during sleep - so someone invented cave painting"
  - at some point in art there was artist who could do perfect realism (or photography). and at this moment some artist thought - what do we do now? important ideas are captured! and when Dali & co comes and draws abstract things and ideas which do not exist in real world. ~go back to original idea of having smth beautiful at home/be able to produce it
  - (comment for last slide^) author provided comparison of number of ~human cognition sensors/cells responsible for vision vs neurons in modeln deep learning - and it's still favorable towards human vision. still mostly a joke as for me but if someone finds natural system easy to build which does not require much training like human vision - that'd be interesting

Efficient networks

from big teacher select channels/layers via pruning/etc - still works
list of current ~sotas

Meta GenAI

emu
- filter, base model sample and filter again
- 16 channel ae important for quality
text-to-3d
- good conceptual slide, also you can think on other tasks (x-to-image(done)/3d/video/4d/model/etc)

Keynotes

Robotics training

LLMs can act like brains of robots (planning agents, etc), but also they can model different users and their different preferences and therefore be a REWARD model as well

Deepmind Research

I liked the part about problem selection - (how to) choose most impactful thing, generally applicable on other scales as well

Papers by topic

Diffusion

Image editing

[*] [8/10] (to edit real image - generate it & modification with self-attention allowed to view original image)MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing there're few similar attn-mapping methods + extensions to a1111, overall legit idea, works

[8/10] (q: how to paste 1 image to another with harmonization (sort of poisson blending problem). a: copy paste noise of inserted image to another noise, map attention masks to the respective locations within the pasted region) TF-ICON: Diffusion-Based Training-Free Cross-Domain Image Composition

[6/10] (image manipulation (single image + new prompt -> manipulated). metric to select denoising step for image2image SD automatically - argmax entropy of diffusion training loss on every step. distill edits to other network after that) Not All Steps are Created Equal: Selective Diffusion Distillation for Image Manipulation might be useful metric

LoRA/Adapters

[] [9/10] (database search by (image, edit description), e.g. (img of a train, "at night"). works by textual inversion to S tokens, but distilled (so any image can get it's token with inference-only)) Zero-Shot Composed Image Retrieval with Textual Inversion models & code released. This should have a lot of applications with relatively trivial modifications. Similar to IP-adapter, just a bit different application scenarious
[*] [8/10] (what other dimensions you can save in tiny weight part finetuning for big model? precision. so technically for personalized loras you can store them in 1-bit precision as they do in the paper w/o loss in quality) Revisiting the Parameter Efficiency of Adapters from the Perspective of Precision Redundancy what do they even train in 1-bit? +1/-1? for how many weights? technically if the claim is not exploited too much one can save e.g. per user checkpoints with great savings in memory
[*] [8/10] (diffusion model for faces with relighting) DiFaReli: Diffusion Face Relighting faces are reconstructed really well and indeed only lighting changes. maybe useful for other decompositions

[6/10] (how to add new modalities encoders to pretrained text2image models? basically you only need paired data of your new modalities and text-image, train small adapter from your modality encoder (can be frozen) for merging with text encoder output before all cross-attentions) GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation good to confirm that simple idea works
[6/10] (how to tune diffusion with few params - train gamma params (for attn activations and feed forward) - their benchmark showed 8x better quality and slightly more param efficiency than 8/16 rank loras) DiffFit: Unlocking Transferability of Large Diffusion Models via Simple Parameter-efficient Fine-Tuning Probably can be used as adapter for sd controlnets as well
[6/10] (customization. encoder to text embedding from 1 image (+main object mask) + finetuned keys/values for SD attention + extra "local" attention (preserving spatial structure & masking) embedding preserving spatial structure & extra trainable keys/values. during training predicts main + extra tokens, the rest is abandoned on inference as non-object related. ) ELITE: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation results are not that impressive (e.g. check kitten). something is missing, but spatial attention from image embedding itself makes sense to me ("local mapping") as well as one-shot encoder

[5/10] (set of binary masks, each connected with word in text + prompt -> segmantation-conditioned generation. how: force attention mask for selected words for which exist binary mask to match binary mask via loss, upd z_t iteratively) Zero-Shot Spatial Layout Conditioning for Text-to-Image Diffusion Models easier to train controlnet these days, you rarely need only single image edit. but maybe connecting such controlnet with text tokens to force attention may improve quality more

Enforce prompt matching

[9/10] (q: how to fix problem that some part of prompt is ignored? e.g. for frog in crown you get just frog. a: you need to fix attention (on finer steps it's disappearing despite initially being present; to do that they introduced losses which adjust z_t during generation)A-STAR: Test-time Attention Segregation and Retention for Text-to-image Synthesis

[8/10] (~textual inversion for exclusive sets of attributes, e.g. gender, skintone, etc by image references. but not with actual textual inversion but by clip embedding optimization similar to stylegan-nada) ITI-GEN: Inclusive Text-to-Image Generation you can generate "man with glasses" but you can't generate "man without glasses" (usually negative prompts don't guarantee that, esp if you generate thousands of images) so that work is useful for controllable generation

Other / better guidance

[*] [8/10] (better claffifier (not free) guidance - backprop to all noises consequitively from original image. super slow. but quality is better) End-to-End Diffusion Latent Optimization Improves Classifier Guidance should also work to any losses (segmentation, identity, etc) since explicit gradient is used. isn't this obvious idea though? too obvious even, I'm surprised clf-guidance was done w/o full denoising, only issue was gradient backward through huge network on all steps so they reformulate it as invertible diffusion here

[5/10] (cfg-like guidance in order to improve sampling quality (~details) - basically reinforces effect of attention vs no use of attention ~attn>threshold mask) Improving Sample Quality of Diffusion Models Using Self-Attention Guidance some sampling quality boost for marginal cost increase. code

Domain adaptation

[*] [7/10] (domain adaptation (for style) on few images: sample from style-specific noise distribution (vae projection mean/std for mean and covariance of diffusion noise instead of N(0, I)) -> finetune diffusion from that noise distribution for ~1k steps. results look good, paper says 50-200 imgs work, poster was ~10-15) Diffusion in Style

[6/10] (turns out 2 stochastic diffusion models, trained independantly, given same "seed" produce related images (!!! lol) -> in this work they are generating surprasingly good paired images from 2 models / edits by prompt modification from single model. results look good)A Latent Space of Stochastic Diffusion Models for Zero-Shot Image Editing and Guidance mostly interesting theory since there's no community interest -> wide adoption in these models for now

Removing/modifying concepts in pretrained diffusion

[8/10] (problem: want to change diffusion assumption on the prompt (e.g. messi -> playing basketball not football, roses -> are blue). solution: given original/edited prompt modify text cross-attn layers to give similar masks for prompt1 to prompt2 -> update these params of the model -> updated model always thinks new behaviour is correct since layers are fused) Editing Implicit Assumptions in Text-to-Image Diffusion Models aka TIME. that's probably better way to patch the model instead of just stripping it away from all knowledge like anti-dreambooth, etc
[6/10] (remove concept C by forcing model to produce same noise as ok concept C', e.g. "grumpy cat"->"cat". side-effect - can preserve individual concepts while removing combinations ("kids with guns"->"kids", but "kids" and "guns" separetely still works)Ablating Concepts in Text-to-Image Diffusion Models probably most practical and easy to use. although the one below ("Erasing Concepts from Diffusion Models") in theory preserves the knowledge of the concept, just does not generate it by prompt directly (which can be good as it keeps more knowledge and bad as... it keeps this knowledge which still can be tampered with other prompts)

[5/10] (see poster explanations. basically use frozen model & tuned one, in tuned one use cfg-like guidance to guide in opposite direction from frozen for selected concepts) Erasing Concepts from Diffusion Models looks like better idea compared to anti-dreambooth

[3/10] (protected images/styles -> tune dreambooth to predict noise on them) Anti-DreamBooth: Protecting Users from Personalized Text-to-image Synthesis erasing should not work like this - it's damaging the model. at least let the model produce some plausable images

Not just text2image

[*] [9/10] (joint image+segmentation map generation by reformulated noise distribution) Learning to Generate Semantic Layouts for Higher Text-Image Correspondence in Text-to-Image Synthesis this is INSIGHTFUL paper. basically they do some math to show that joint generation is equivalent to separate. insight is that you need MUCH LESS data because you predict multiple things together. e.g. for generative models on videos, 3d, etc difficult problems (more difficult than just images) should be very helpful

[5/10] (paired dataset included) Generating Realistic Images from In-the-wild Sounds
[4/10] (text/image-to-text/image) Versatile Diffusion: Text, Images and Variations All in One Diffusion Model
[4/10] (train unconditional diffusion + conditional (on 1 view, targets obtained throigh warp). 360 views by iterative inference) 3D-aware Image Generation using 2D Diffusion Models maybe useful proxy for something, but not very practical. quality of warped targets likely low

Security risks

[7/10] (add backdoor to TEXT ENCODER to poison ANY text2image model trained on that. cyrillic "o" is invisible even for humans attack) Rickrolling the Artist: Injecting Backdoors into Text Encoders for Text-to-Image Synthesis most of current text2image models are based on clip, so if somehow official checkpoint will be hacked all the models will also get hacked. bad prompt filtering pipelines should probably check for such attacks now before inference though. now what is really interesting - maybe instead of injecting backdoors they're already there - what if someone can find "abirvalg" - some weird combination of tokens which activates backdoor mode. sort of like "Try to impersonate DAN" attack for LLMs but prompt has to be optimized. what the finding of the paper tells is that the found magic word would affect all models trained with such encoder

[6/10] (imagine you release model with invisible watermarking. if someone infers directly that model - you probably detect it reliably. if someone finetunes it - your watermarking is mostly useless. in this work they added extra loss for close params to also produce watermarked models, sort of GAN game) Towards Robust Model Watermark via Reducing Parametric Vulnerability

Faster inference

[*] [7/10] (problem: pretrained t2i model -> how to infer fast and not loose quality. solution: look for non-uniform steps + subnetwork -> define search space -> evolutionary search (w/o retraining, only inference) with FID eval -> good results) AutoDiffusion: Training-Free Optimization of Time Steps and Architectures for Automated Diffusion Model Acceleration results too good for such simple method... didn't check details but interesting how they not loose much quality by abandoning some parts of architecture

Unsorted

[7/10] (dataset for gt attribution via diffusion customization -> eval existing approaches -> tune clip/etc -> clip(generated)@clip(ref) -> estimate attribution. quality of attibution predictor trained is surprisingly good [at least high matches are super relevant images]) Evaluating Data Attribution for Text-to-Image Models only issue is that it just compares images, it does not know if it was in training or not. but e.g. artist compensation is possible based on that
[6/10] (sort of bayes decomposition and basis is discovered automatically. learned concepts can be combined) Unsupervised Compositional Concepts Discovery with Text-to-Image Generative Models the idea itself is cool, but if I understand coerrectly number of concepts is hyperparameter
[4/10] (argmin|eps_pred(img of class_i)-eps|) Your Diffusion Model is Secretly a Zero-Shot Classifier smart, but not very useful

Data

Investigations

[5/10] (findings - 1)prompts on average lead to India/US as most relevant 2)adding country into the prompt improves generation but not completely sufficient 3)dalle2 likely had much better filtration than SD because for (2) they have bigger improvement)Inspecting the Geographical Representativeness of Images from Text-to-Image Models that's important research but sometimes it's surprising how you can publish on top venues just investigating the data

[3/10] (some biases of models, ~light skin -> more feminine, etc) Beyond Skin Tone: A Multidimensional Measure of Apparent Skin Color

Dataset compression to N samples

[2/10] (models trained on data-distilled to few samples are overconfident -> need "calibration" (more reasonable logit distribution) -> some fixes suggested in this paper) Rethinking Data Distillation: Do Not Overlook Calibration since original problem (compressing dataset to 100 samples) is still not useful (quality is bad, generalization beyound cifar100 is unlikely), adjustments are also not helpful

Datasets

[*] [6/10] (dataset of 10k artifacts segmentation from various generative models - GANs/Diffusion. also trained segmentation & inpainting but not code yet) Perceptual Artifacts Localization for Image Synthesis Tasks probably useful but not sure about quality, esp on new types of images
[5/10] (dataset with 5k diverse photos from smartphones estimated by experts on quality metrics) An Image Quality Assessment Dataset for Portraits that's cvpr paper but company had a booth and advertised it. research-only license, terms probably tricky, idea to get such labelling from experts is worthy though. I talked with them a little bit - important thing for quality estimations is to recompute benchmarks (cameras becoming better and better so there's no "perfect" quality in gt) every 1-2 years at least. camera producers usually go to these agencies to measure their quality (ratings are open although I'm not sure customers actually visit such websites to check)

Synthetic labels

[4/10] (self-explanatory title. quality claimed to be ok on real data and sota on zero-shot approaches) DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic Segmentation Using Diffusion Models labelling is ofc superior if you have resources. probably same approach can be useful in some other problems
[4/10] (optimization-based refinement of mask from attention, tune diffusion & update mask. expensive, but claim sota zero-shot) Foreground-Background Separation through Concept Distillation from Generative Image Foundation Models

Multi-modality

VLMs (VQA, captioning, zero-shot classification, etc)

[8/10] (~synth data from llms, learned classifier on clip text embedding when infer on image clip embedding. close to sota on captioning, vqa, destroyed sota on some unpopular tasks) I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision I think I've seen other similar work where there's extra adapter image embedding -> text embedding and it should work even better.
[5/10] (attempt to cure ood hallusinations on captioning for VLMs) Transferable Decoding with Visual Entities for Zero-Shot Image Captioning sophisticated bert-like method to avoid overfit and allow ood generalization. maybe it's a better idea to just have ood examples to bring it in-domain?

CLIP Training

[8/10] (equivariant here ~= text-image scores are proportial to actual relevance, i.e. 0.7 is meaningful. given 2 image-text pairs [semantically similar! ~= not too different] they design simple losses e.g. similarity(text_1, image_2)==similarity(text_2, image_1), see others below) Equivariant Similarity for Vision-Language Foundation Models labelling is even more crucial for such alignment training. can be combined with other clip optimizations in training (e.g. filtering "hard samples" which are technically valid pairs)

[7/10] (affinity mimicking = same distribution of text-image similarity on train batch, weight inheritance = choose teacher weights part) TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance smaller/faster clip models are useful by themselves when performance matters. side note: progressive distillation works better (e.g. 100%->25% capacity is worse than 100->50->25 for same training time)

Prompt tuning

[5/10] (prompt learning for classification with couple extra losses) Self-regulating Prompts: Foundational Model Adaptation without Forgetting

CLIP zero-shot quality improvements

[6/10] (llm generates prompts per every class need to be detected. q for llm: "what does [class_name] look like?". claims to be better than hand-designed (well it scales up easily true)) What Does a Platypus Look Like? Generating Customized Prompts for Zero-Shot Image Classification there were some other work saying that writing "[random characters] [class_name]" gives better accuracy than LLM-designed ones (averaged among these random characters ofc)
[6/10] (learn N "style" text embeddings ~ "a S_i style of [object]" where object is dog/cat/etc classes. styles are not supervised on anything, only text encoder and no images are used. on top of learned style augmented prompts linear classifier is trained. in the end argmax clip mean "a s_i style of [class_name]". works very good) PromptStyler: Prompt-driven Style Generation for Source-free Domain Generalization again I remember work with image2text clip embedding adapter with similar idea on this conference - should work even better. does not lead to interesting text2image styles, but probably some modifications can help finding interesting ones automatically (although with visual feedback should be better)

CLIP Inference

[6/10] (draw red circle around smth -> see what clip predicts where among the proposed variants. man->criminal, woman->missing. works for landmarks/etc) What does CLIP know about a red circle? Visual prompt engineering for VLMs that's nice exploit. direct practical usage is unclear, though on the meta-level should be applicable to other models

CLIP Data/abilities

[9/10] (that's some unknown paper from conference) (0 shot better - for contrastive pretraining because of ambiguity and large batch might be good img-text negative pairs (i.e. not on batch diagonal), so instead they consider 3 similarities between img/text/ij - imgs, texts, img-texti for the right loss) very simple/obvious idea yet very helpful
[5/10] (just finetune clip on data with number or objects in captions) Teaching CLIP to Count to Ten
[3/10] (basically tuned clip on negative image-text pairs by adding "no" to prompt, e.g. "image of NO cat" with dog image) CLIPN for Zero-Shot OOD Detection: Teaching CLIP to Say No

GANs

Latent space manipulations

[7/10] (connect w to selected regions (hardset or segmentation) by adding loss for this during training) LinkGAN: Linking GAN Latents to Pixels for Controllable Image Synthesis likely not very practical, interesting attributes are heavily entangled (if you change one, another have to change). nice idea though

[4/10] (gan features + face parameters (expression, lighting, pose) -> learn mapping to "better" space. cherry picked examples in paper are better than styleflow) Conceptual and Hierarchical Latent Space Decomposition for Face Editing didn't check details
[3/10] (find latent directions in gans, ~stylegans) Householder Projector for Unsupervised Latent Semantics Discovery found directions look entangled as usual. didn't check details

Domain adaptation

[*] [7/10] (stylegan-nada improved. idea: instead of single text direction use multiple ones and match distributions of image & text directions. 1)find multiple directions ~close to original trg text embedding + most dissimilar from one another -> ~uniformly distributed some distance around the embedding of trg text 2)for image-image directions in training batch and text-text directions (original and all augmented) penalize both mean and covariance mismatch. see formulas below for details) Improving Diversity in Zero-Shot GAN Adaptation with Semantic Variations stylegan-nada is sort of simple and naive loss, there're many small changes you could propose to improve results (see e.g. this work for clip within loss)

[6/10] (photo->avatar. where avatar is parametrized model, i.e. hair/ear/eyebrow/etc params. train 2 unconditional generator (real faces & parameters of avatars) after which mapping between them. they had small paired dataset of 400 imgs, hand-crafted by artist on volunteer selfies) Cross-modal Latent Space Alignment for Image to Avatar Translation random thought: maybe it's possible to learn alignment between 2 different generators (like 2 face GANs -> learn mapping from first GAN latent space to second GAN latent space. only need some paired data)
[5/10] (gan finetuning to dissimilar domain. 2 ideas: 1)~regularize init vs tuned feature discribution ("smoothness") 2)multi-resolutioal patch D) Smoothness Similarity Regularization for Few-Shot GAN Adaptation results on 10 imgs are still crap but less crap (in more reasonable problems should help as well). haven't found comparison vs augmentations (stylegan2-ada, etc) - usually that helped a lot)

Unsorted

[6/10] (reason: statistical assumption of stylegan2 is too strong [the one that we do not need subtract mean in weight demodulation], which leads to some ood features [high values] ) Feature Proliferation -- the "Cancer" in StyleGAN and its Treatments. they can both detect [heuristic] and fix [rescaling affected areas] the issue. both are cheap, see formulas

Training improvements

Training

[6/10] (train bigger model from smaller) TripLe: Revisiting Pretrained Model Reuse and Progressive Learning for Efficient Vision Transformer Scaling and Searching sort of curriculum learning but for weights. might be a bit useful if you go to ridiculously large models (GPT5 or something)

[5/10] (lowres -> highres & increase strength of augs. downsample is done with cropping in freq domain [claim is that's more precise/justified]. training cost of huge model trained from scratch reduced ~20-30%) EfficientTrain: Exploring Generalized Curriculum Learning for Training Visual Backbones it'd be more curious how it generalizes to LLM/other foundational models. likely useless for finetuning. training on lowres first is beyound obvious as well as increasing strength of augmentations, still maybe take practical tips from here. for augmentations they use RandAug with progressive strengths. frequency domain also can be explored more during training (e.g. more losses, etc).

[5/10] (7x faster imagenet training ~any model. stored 400 augs per image with 10model-ensemble average predictions. training code 2 lines change) Reinforce Data, Multiply Impact: Improved Model Accuracy and Robustness with Dataset Reinforcement thank you. not really often common people need to train smth on imagenet from scratch but might be useful eventually

Finetuning/other task adaptation

[3/10] (continual learning - pretrain base transformer -> adapt per task with conv transforming attn weights) Exemplar-Free Continual Transformer with Convolutions this is also for class-incremental setup (task id is not known - which is ?stupid because in practice it's known always. well at least chatgpt didn't provide reasonable answer)

Decoding

[3/10] (autoregressive decoder with k tokens/inference step for image compression. show that predefined token sampling schedule perform as well or better than random (how's that not obvious though?)) M2T: Masking Transformers Twice for Faster Decoding no new insights

Losses

[7/10] (simple tasks but on very high resolution - try to make realtime. idea1: overparametrization (10 branches later result to single 5x5 conv on inference), idea2: lightweighted feature fusion (f1*f2+bias), idea3: outlier-aware loss (prevents blur of l2 loss)) SYENet: A Simple Yet Effective Network for Multiple Low-Level Vision Tasks with Real-Time Performance on Mobile Device results are not super impressive but maybe some of that can be useful (OOL/overparametrization)

[6/10] (1-img stylization preserving content. for content preservation patch contrastive loss from eps-predictor of diffusion [so full denoising / noise-aware models not required]) Zero-Shot Contrastive Loss for Text-Guided Diffusion Image Style Transfer quality is complete crap (which is strange, probably they should not do it in patch-contrastive way because features from different regions of images are often similar?) but general idea of using eps-predictor as sort-of perceptual similarity source/target looks legit, should be applicable for other problems (definetely applied somewhere though?)
[6/10] (forced classifier to attend the right place of the image and it improved quality and maybe reliability) Studying How to Efficiently and Effectively Guide Models with Explanations

Federated learning

[*] [8/10] (how to adapt classifier per every user to improve quality. user side setup: clf(model(img, prompt)) where prompt is just a few trainable params, model is frozen (e.g. pretrained foundational model), clf is local per-client classifier. server setup: base prompts, prompt generator network (user descriptor -> better user prompt). rough idea: when training starts baseline zero-shot performance already works somehow, every training step you don't have GT but use current inference prediction as GT to update the system. so every step on user side clf, prompt are updated and on backend side prompt generator is updated) Efficient Model Personalization in Federated Learning via Client-Specific Prompt Generation some variation for personalized text2image models?

Architectures

Tricks

[6/10] (heuristic to process only part of tokens in ViT - the important/difficult ones. in practice can be ~25% but depends on task) A-ViT: Adaptive Tokens for Efficient Vision Transformer

[5/10] (same trick with skipping most tokens, for videos) Eventful Transformers: Leveraging Temporal Redundancy in Vision Transformers
[5/10] (~~abandon some tokens~~ -> focus attention on some tokens) Less is More: Focus Attention for Efficient DETR claims to have better quality/speed tradeoff than abandoning
[3/10] (normally early exit idea on easy samples does not work / performance degrade, their solution marginally improve it) Dynamic Perceiver for Efficient Visual Recognition mostly added to have ref for early exit approach and that it does not really work well

Attention

[9/10] (instead of memory-augmented attention [=learnable extra keys and values] they reuse keys and values from the previous N training samples. motivation is that this should better focus on individual samples instead of being beneficial for the entire dataset on average. note that this is actual memory - previous outputs/thoughts of the network. to not store too many memories they use k-means centers. memories are updated every N training batches with k-means again) With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning memory should be very useful for other generative tasks, maybe not the approach itself but idea at least. like hashgrid encodings in nerfs, some sort of memory for the network to be able to operate, not just extract everything from input & biases, that's very reasonable
[8/10] (better linear attention. linear attn has quality drop, so they investigate issues and fix them. Y=phi(Q)phi(K)V + depthwise(V) where phi(x)=||x||*(x^p)/||x^p|| where x^p is elementwise power p) FLatten Transformer: Vision Transformer using Focused Linear Attention

[6/10] (relu linear attention fixes with depthwise convs) EfficientViT: Lightweight Multi-Scale Attention for High-Resolution Dense Prediction

[6/10] (attend with different level of detail, based on network own attn map. claims sota of the time with some margin) SG-Former: Self-guided Transformer with Evolving Token Reallocation the idea makes sense, not sure how efficient the implementation is though

[2/10] Gramian Attention Heads are Strong yet Efficient Vision Learners no benefit for now, theory + work on par

Modules/Layers

[7/10] (1d oriented convs with efficient cuda implementation -> quality ~same as 2d on some tasks -> receptive field is bigger) Convolutional Networks with Oriented 1D Kernels on-device efficiency is always questionable for new layers, but idea is interesting. maybe some combination of 2d & 1d-oriented is better. e.g. 1d-oriented with huge kernel sizes to allow good receptive field, but 1 time per block only. or might be interesting to predict orientation first (per-pixel) and when use it similar to attention (although not sure it will be efficient this way)

[6/10] (compression) Shortcut-V2V: Compression Framework for Video-to-Video Translation Based on Temporal Redundancy Reduction maybe one interesting idea - they use deformable convs architecture, might be suitable for other video tasks
[6/10] (similar to SE block with better accuracy claimed, no params for visual transformers: y = x/2+x.mean(channel_dim)/2. motivation: investigation vits found that they try to learn dense (~=close to uniform per tokens) attention, despite it's hard to learn due to high gradients in these areas. the proposed module is explicit parameter-free extreme dense attention (=uniform attention)) Scratching Visual Transformer's Back with Uniform Attention
[5/10] (~inception architecture but with attention) Scale-Aware Modulation Meet Transformer marginal improvements, likely speed downgrade

[5/10] (inside inverted bottleneck some efficient attn operator) Rethinking Mobile Block for Efficient Attention-based Models probably useful didn't check details
[4/10?] (multiple feature aggregation modules) Building Vision Transformers with Hierarchy Aware Feature Aggregation only gave moment look but marginal improvements, not fast implementation? (e.g. clustering is parameter-free but not instant)

Downsample/upsample

[5/10] (generalization of all pooling algorithms, trainable, better quality, ~slower, targeted for final pooling before classification, gives better attention maps) Keep It SimPool: Who Said Supervised Transformers Suffer from Attention Deficit?

[5/10] (tiny resizer (downsample). claimed to be useful for classification/segmentation to work on high resolution) MULLER: Multilayer Laplacian Resizer for Vision on device passing large textures to gpu is expensive operation so not sure how useful it is in practice
[5/10] (light upsample for dense predictions (but not superres)) Learning to Upsample by Learning to Sample likely marginal improvements in real life if any
[4/10] (spectral space pooling, claims sota quality on classification/segmentation) SPANet: Frequency-balancing Token Mixer using Spectral Pooling Aggregation Modulation not sure it's actually fast
[4/10?] (flexible downsample with non-integer stride) FDViT: Improve the Hierarchical Architecture of Vision Transformer only gave moment look but 0.4 improvement is questionable, flexible downsample is unlikely fast as well

Misc architectures

[5/10] (DiT; unet->transformer leads to much more scaleble architecture (both up and down) & claims better FID with same flops)Scalable Diffusion Models with Transformers why meta/stability didn't use transformers in EMU/sdxl then? or any other work using transformer instead of unet? (tldr from Slavchat discussion - there's some proof that what is important is computation - it should give +- same quality regardless of architecture (with reasonable architectures). in theory the benefit of transformers is that they're more easily scaleble. Kudos to Michael, Vadim, Seva, Aleksandr, George)
[4/10] Masked Autoencoders Are Stronger Knowledge Distillers note: masked autoencoders = bert-like
[3/10] (hyperbolic space operations end2end first network) Poincare ResNet maybe interesting in 5 years

Encrypted inference

(faster relu in ectrypted space) AutoReP: Automatic ReLU Replacement for Fast Private Network Inference imagine you have hosting on cloud for the model but don't want google or whoever is maintaining this to know which data you sent/predictions obtained. so turns out there're networks which operate on encripted input and return encripted output. and ReLU in such representation is a bottleneck (~100x slower compared to convs lol). that's actually interesting alternative for infering user photos on backend with extra guarantees if needed (banks, governments, etc)
(NAS for privacy-inference mode aware vits) MPCViT: Searching for Accurate and Efficient MPC-Friendly Vision Transformer with Heterogeneous Attention

Video

Video Generation

(no SD training. first frame latent -> denoise to an extent -> warp with camera motions through time -> noise again per frame -> denoise completely. +for complete denoising modified self-attn to work for every frame on first frame keys/values) Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators again old one

Video + Audio

[7/10] (audio->video, works on ~backgrounds) The Power of Sound (TPoS): Audio Reactive Video Generation with Stable Diffusion just curious: can it generate youtube video loop footage from background music? if quality is good enough maybe that's practical, if no:(
[7/10] Video Background Music Generation: Dataset, Method and Evaluation

Video Editing

[*] [8/10] StableVideo: Text-driven Consistency-aware Diffusion Video Editing code improvements over text2live for atlas-based stylization, probably has potential (need further reading)
[3/10] (driving video + ref -> edit. based on gans so likely outdated) VidStyleODE Disentangled Video Editing via StyleGAN and NeuralODEs likely not very practical + outdated. clip consistency loss between close frames is interesting regularization but likely introduces some error on it's own

Video Stylization

[3/10] (video stylization. train depth-conditioned video generator with extra temporal blocks -> infer on real video depth + edit prompt) Runway GEN1 just one more reminder how outdated is iccv research...

Video Tagging

[4/10] Order-Prompted Tag Sequence Generation for Video Tagging

Misc

[7/10] (anime inbetweening [sketches]. new small dataset & code) Deep Geometrized Cartoon Line Inbetweening the dataset obtained in synthetic way (blender 3d models). that's important problem, which should be solved relatively easily, just need the data. my guess is anime studios are long working on it because it's the easiest and most straightforward thing in animation pipeline to be optimized. and it's sort of solved for video interpolation already so really - only need the data. this is the first work on the topic surprisingly
[6/10] Story Visualization by Online Text Augmentation with Context Memory I just like the problem. Quality is looking far from good for now
[6/10] (how humans selected names for colors? why the colors are chosen this way? this paper analyses natural world colors distribution and tries to go from few colors by splitting 1 color into 2 many times and comes to close to natural evolution. naming of colors is not considered as it is completely random but the colors themselves are relatively similar) Name Your Colour For the Task: Artificially Discover Colour Naming via Colour Quantisation Transformer great idea, might be useful to predict future colors by extrapolation / colors of alien planets in games, etc
[4/10] (artistic text by diffusion) DS-Fusion: Artistic Typography via Discriminated and Stylized Diffusion this one is better imo
DLT: Conditioned layout generation with Joint Discrete-Continuous Diffusion Layout Transformer exists
Guided Motion Diffusion for Controllable Human Motion Synthesis exists

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md

asmekal/iccv-2023-notes

Folders and files

Latest commit

History

Repository files navigation

Table of contents

Intro

Conference Stats

My experience

If you're reading this for some weird reason and you're not me

Main Insights

Paper description format

Workshops

Video workshop

Continual learning workshop

Quo Vadis / State of Computer Vision

Efficient networks

Meta GenAI

Keynotes

Papers by topic

Diffusion

Image editing

LoRA/Adapters

Enforce prompt matching

Other / better guidance

Domain adaptation

Removing/modifying concepts in pretrained diffusion

Not just text2image

Security risks

Faster inference

Unsorted

Data

Investigations

Dataset compression to N samples

Datasets

Synthetic labels

Multi-modality

VLMs (VQA, captioning, zero-shot classification, etc)

CLIP Training

Prompt tuning

CLIP zero-shot quality improvements

CLIP Inference

CLIP Data/abilities

GANs

Latent space manipulations

Domain adaptation

Unsorted

Training improvements

Training

Finetuning/other task adaptation

Decoding

Losses

Federated learning

Architectures

Tricks

Attention

Modules/Layers

Downsample/upsample

Misc architectures

Encrypted inference

Video

Video Generation

Video + Audio

Video Editing

Video Stylization

Video Tagging

Other problems

Vector graphics

Style transfer

3D

image2image

Inpainting

Face Recognition

Hair simulation/editing/animation

Detection

Adaptation to the unknown

Misc

About

Topics

Resources

Packages