PhD Candidate, Stanford University 2017 - present
Computer Science Graphics Lab. Advised by Kayvon Fatahalian.
Computer Science Graphics Lab. Advised by Kayvon Fatahalian.
Computer Science Theory Track.
Computer Science Systems Track. With Honors & Distinction.
How to frame (or crop) a photo often depends on the image subject and its context; e.g., a human portrait. Recent works have defined the subject-aware image cropping task as a nuanced and practical version of image cropping. We propose a weakly-supervised approach (GenCrop) to learn what makes a high-quality, subject-aware crop from professional stock images. Unlike supervised prior work, GenCrop requires no new manual annotations beyond the existing the stock image collection. The key challenge in learning from this data, however, is that the images are already cropped and we do not know what regions were removed. Our insight is combine a library of stock images with a modern, pre-trained text-to-image diffusion model. The stock image collection provides diversity and its images serve as pseudo-labels for a good crop, while the text-image diffusion model is used to out-paint (i.e., outward in-painting) realistic, uncropped images. Using this procedure, we are able to automatically generate a large dataset of cropped-uncropped training pairs to train a cropping model. Despite being weakly-supervised, GenCrop is competitive with state-of-the-art supervised methods and significantly better than comparable weakly-supervised baselines on quantitative and qualitative evaluation metrics.
We introduce the task of spotting temporally precise, fine-grained events in video (detecting the precise moment in time events occur). Precise spotting requires models to reason globally about the full-time scale of actions and locally to identify subtle frame-to-frame appearance and motion differences that identify events during these actions. Surprisingly, we find that top performing solutions to prior video understanding tasks such as action detection and segmentation do not simultaneously meet both requirements.
In response, we propose E2E-Spot, a compact, end-to-end model that performs well on the precise spotting task and can be trained quickly on a single GPU. We demonstrate that E2E-Spot significantly outperforms recent baselines adapted from the video action detection, segmentation, and spotting literature to the precise spotting task. Finally, we contribute new annotations and splits to several fine-grained sports action datasets to make these datasets suitable for future work on precise spotting.
James Hong, Haotian Zhang, Michaël Gharbi, Matthew Fisher, and Kayvon Fatahalian. In proceedings of the European Conference on Computer Vision (ECCV), 2022.
Human pose is a useful feature for fine-grained sports action understanding. However, pose estimators are often unreliable when run on sports video due to domain shift and factors such as motion blur and occlusions. This leads to poor accuracy when downstream tasks, such as action recognition, depend on pose. End-to-end learning circumvents pose, but requires more labels to generalize.
We introduce Video Pose Distillation (VPD), a weakly-supervised technique to learn features for new video domains, such as individual sports that challenge pose estimation. Under VPD, a student network learns to extract robust pose features from RGB frames in the sports video, such that, whenever pose is considered reliable, the features match the output of a pretrained teacher pose detector. Our strategy retains the best of both pose and end-to-end worlds, exploiting the rich visual patterns in raw video frames, while learning features that agree with the athletes' pose and motion in the target video domain to avoid over-fitting to patterns unrelated to athletes' motion.
VPD features improve performance on few-shot, fine-grained action recognition, retrieval, and detection tasks in four real-world sports video datasets, without requiring additional ground-truth pose annotations.
James Hong, Matthew Fisher, Michaël Gharbi, and Kayvon Fatahalian. In proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
Cable (TV) news reaches millions of US households each day. News stakeholders such as communications researchers, journalists, and media monitoring organizations are interested in the visual content of cable news, especially who is on-screen. Manual analysis, however, is labor intensive and limits the size of prior studies.
We conduct a large-scale, quantitative analysis of the faces in a decade of cable news, from the top three US cable news networks (CNN, FOX, and MSNBC) spanning January 2010 to July 2019 and totaling 244,038 hours of video. Our work uses technologies such as automatic face and gender recognition to measure the ''screen time'' of faces and enable visual analysis and exploration at scale. Our analysis method gives insight into a broad set of socially relevant topics. For instance, male-presenting faces receive much more screen time than female-presenting faces (2.4x in 2010, 1.9x in 2019).
To make our dataset and annotations accessible, we release a public interface that allows anyone to write queries and to perform their own analyses.
James Hong, Will Crichton, Haotian Zhang, Daniel Y. Fu, Jacob Ritchie, Jeremy Barenholtz, Ben Hannel, Xinwei Yao, Michaela Murray, Geraldine Moriba, Maneesh Agrawala, and Kayvon Fatahalian. In proceedings of the 27th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2021.
We describe the results of a randomized controlled trial of video-streaming algorithms for bitrate selection and network prediction. Over the last year, we have streamed 38.6 years of video to 63,508 users across the Internet. Sessions are randomized in blinded fashion among algorithms.
We found that in this real-world setting, it is difficult for sophisticated or machine-learned control schemes to outperform a "simple" scheme (buffer-based control), notwithstanding good performance in network emulators or simulators. We performed a statistical analysis and found that the heavy-tailed nature of network and user behavior, as well as the challenges of emulating diverse Internet paths during training, present obstacles for learned algorithms in this setting.
We then developed an ABR algorithm that robustly outperformed other schemes, by leveraging data from its deployment and limiting the scope of machine learning only to making predictions that can be checked soon after. The system uses supervised learning in situ, with data from the real deployment environment, to train a probabilistic predictor of upcoming chunk transmission times. This module then informs a classical control policy (model predictive control).
To support further investigation, we are publishing an archive of data and results each week, and will open our ongoing study to the community. We welcome other researchers to use this platform to develop and validate new algorithms for bitrate selection, network prediction, and congestion control.
USENIX NSDI Community Award
IRTF Applied Networking Research Prize
Francis Y. Yan, Hudson Ayers, Chenzhi Zhu, Sadjad Fouladi, James Hong, Keyi Zhang, Philip Levis, and Keith Winstein. In Proceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2020.
The Internet of Things (IoT) is changing the way we interact with everyday objects in the world around us. However, IoT devices are also notorious for ther lax security and the vulnerabilities that they introduce into our computing networks.
This paper describes Bark, a system for specifying and enforcing default-off access control in home IoT networks. Bark phrases access control policies in terms of natural questions (who, what, where, when, and how) and transforms them into transparently enforceable rules for IoT devices. Bark can express detailed rules such as “Let the lights see the luminosity of the bedroom sensor at any time” and “Let a device at my front door, if I approve it, unlock my smart lock for 30 seconds” in a way that is presentable to users.
James Hong, Amit Levy, Laurynas Riliskis, and Philip Levis. In Proceedings of the 3rd ACM/IEEE International Conference on Internet of Things Design and Implementation (IoTDI), 2018.
Careful resource monitoring is necessary to understand usage patterns and set conservation goals in an institutional setting. Sensor systems provide data to measure consumption and evaluate the effectiveness of active interventions. However, deploying sensing systems can be difficult when infrastructure support is limited. This paper describes the process of designing Tethys, a wireless water flow sensor that collects data at per-fixture granularity without dependence on existing infrastructure and trusted gateways. Rather than rely on electrical infrastructure, Tethys implements energy harvesting to allow for long term deployment. To avoid dependence on existing network infrastructure, Tethys crowdsources the data collection process to residents’ smartphones acting as gateways. These gateways are untrusted and unreliable, so Tethys implements end-to-end reliability and security between the sensing device and a cloud backend.
Holly Chiang, James Hong, Kevin Kiningham, Laurynas Riliskis, Philip Levis, and Mark Horowitz. In Proceedings of the 3rd ACM/IEEE International Conference on Internet of Things Design and Implementation (IoTDI), 2018.
The next generation of computing peripherals will be low-power ubiquitous computing devices such as door locks, smart watches, and heart rate monitors. Bluetooth Low Energy is a primary protocol for connecting such peripherals to mobile and gateway devices. Current operating system support for Bluetooth Low Energy forces peripherals into vertical application silos. As a result, simple, intuitive applications such as opening a door with a smart watch or simultaneously logging and viewing heart rate data are impossible. We present Beetle, a new hardware interface that virtualizes peripherals at the application layer, allowing safe access by multiple programs without requiring the operating system to understand hardware functionality, fine-grained access control to peripheral device resources, and transparent access to peripherals connected over the network. We describe a series of novel applications that are impossible with existing abstractions but simple to implement with Beetle.
Amit Levy, James Hong, Laurynas Riliskis, Philip Levis, and Keith Winstein. In Proceedings of the 14th International Conference on Mobile Systems, Applications and Services (MobiSys), 2016.
The ability to modify images to add new objects into a scene stands to be a powerful image editing control, but is currently not robustly supported by existing diffusion-based image editing methods. We design a two-step method for inserting objects of a given class into images that first predicts where the object is likely to go in the image and, then, realistically inpaints the object at this location. The central challenge of our approach is predicting where an object should go in a scene, given only an image of the scene. We learn a prediction model entirely from synthetic data by using diffusion-based image out-painting to hallucinate novel images of scenes surrounding a given object. We demonstrate that this weakly supervised approach, which requires no human labels at all, is able to generate more realistic object addition image edits than prior text-controlled diffusion-based approaches. We also demonstrate that, for a limited set of object categories, our learned object placement prediction model, despite being trained entirely on generated data, makes more accurate object placements than prior state-of-the-art models for object placement that were trained on a large, manually annotated dataset.
Lu Yuan, James Hong, Vishnu Sarukkai, and Kayvon Fatahalian. SytheticData4ML Workshop, NeurIPS 2023.
Many real-world video analysis applications require the ability to identify domain-specific events in video, such as interviews and commercials in TV news broadcasts, or action sequences in film. Unfortunately, pre-trained models to detect all the events of interest in video may not exist, and training new models from scratch can be costly and labor-intensive. In this paper, we explore the utility of specifying new events in video in a more traditional manner: by writing queries that compose outputs of existing, pre-trained models. To write these queries, we have developed Rekall, a library that exposes a data model and programming model for compositional video event specification. Rekall represents video annotations from different sources (object detectors, transcripts, etc.) as spatiotemporal labels associated with continuous volumes of spacetime in a video, and provides operators for composing labels into queries that model new video events. We demonstrate the use of Rekall in analyzing video from cable TV news broadcasts, films, static-camera vehicular video streams, and commercial autonomous vehicle logs. In these efforts, domain experts were able to quickly (in a few hours to a day) author queries that enabled the accurate detection of new events (on par with, and in some cases much more accurate than, learned approaches) and to rapidly retrieve video clips for human-in-the-loop tasks such as video content curation and training data curation. Finally, in a user study, novice users of Rekall were able to author queries to retrieve new events in video given just one hour of query development time.
Dan Fu, Will Crichton, James Hong, Xinwei Yao, Haotian Zhang, Anh Truong, Avanika Narayan, Maneesh Agrawala, Christopher Ré, Kayvon Fatahalian. Workshop on AI Systems, SOSP 2019
Embedded sensor networks often consist of a 3-tier architecture consisting of embedded nodes, gateways that connect an embedded network to the wider Internet, and data services in servers or the cloud. Yet, IoT applications are developed for each tier separately. Consequently, the developer needs to amalgamate these distinct applications together.
This paper proposes a novel approach for programming applications across 3-tiers using a distributed extension of the Model-View-Controller architecture. We add new primitive: a space - that contains properties and implementation of a particular tier. Writing applications in this architecture affords numerous advantages: automatic model synchronization, data transport, and energy efficiency.
Laurynas Riliskis, James Hong, and Philip Levis. In Proceedings of the International Workshop on Internet of Things towards Applications (IoT-App), 2015.
Aut `22, Aut `23
Aut `16, Aut `17
Spr `17, Win `17
Win `16, Sum `18
Computer Science, Graphics Lab
Creative Intelligence Lab, San Francisco
Data analytics infrastructure team
Experimentation platform team
Click on the thumbnails to view.