Data Analytics in Security and Privacy

August 25-27, 2021

Attend: Zoom Link (You may need to authenticate through your institution’s Zoom account before joining)
Watch Workshop Videos: YouTube Channel (From our YouTube Channel select the “Live Now” stream)


Data science has emerged as a powerful set of tools used to understand and predict the behavior of complex systems. It is also being actively explored in the cyber-security context, in applications ranging from analyzing billions of Internet-connected devices for evidence of having patched dangerous vulnerabilities, to data-centric approaches to intrusion detection, to machine-learning-based descriptions of normal system behavior, to attacks on the ML-based models and the learning process itself, to the revelation of private information in anonymized data records. This workshop aims to bring together those interested in better understanding the intersection of data science and cyber-security and privacy. Speakers include leaders conducting cutting edge research, industrial practitioners, and government officials.


Bo Li (University of Illinois at Urbana-Champaign), David Nicol (University of Illinois at Urbana-Champaign), Sean Peisert (Lawrence Berkeley National Laboratory, University of California, Davis)


Somesh Jha (University of Wisconsin), George Kesidis (Pennsylvania State University), Zico Kolter (Carnegie Mellon University), Aleksander Madry (Massachusetts Institute of Technology), Patrick Drew McDaniel (Pennsylvania State University), Franzi Roesner (University of Washington), Stefan Savage (University of California, San Diego), Zach Tudor (Idaho National Laboratory), Mayank Varia (Boston University), Ce Zhang (ETH Zurich)

(All times are Pacific Time)

Day 1 (Wednesday, August 25)

11:00 am – 11:15 am: Welcome and Opening Remarks, David Nicol (University of Illinois at Urbana-Champaign), R. Srikant (University of Illinois at Urbana-Champaign)

11:15 am – 11:45 am: Keynote Presentation: Data Analytics for Infrastructure Protection, Zach Tudor (Idaho National Laboratory)

Abstract: Artificial intelligence (AI) has substantially evolved in the last half century. Today, America’s national laboratories use machine learning and data analysis to address and solve complex energy and national security challenges. In this presentation, I will discuss how AI is woven throughout the Idaho National Laboratory’s national security portfolio and highlight several research projects focused on the resiliency of the nation’s infrastructure.

Zachary Tudor

Speaker: Zach Tudor is the Associate Laboratory Director for Idaho National Laboratory’s National and Homeland Security directorate. In this role, he directs a staff of more than 800 people responsible for nearly $500 million in annual research and development funding focused on industrial cybersecurity, infrastructure resilience, nuclear nonproliferation, and materials science. The laboratory’s national security missions support major programs for Departments of Energy, Defense, and Homeland Security. Previously, he served as Program Director in the Computer Science Laboratory at SRI International, where he acted as a management and technical resource for operational and research and development cybersecurity programs for government, intelligence, and commercial projects. He supported the Department of Homeland Security’s Cyber Security Division on projects including the Linking the Oil and Gas Industry to Improve Cybersecurity consortium, and the Industrial Control System Joint Working Group. He was recently appointed to chair the board of directors for the International Information System Security Certification Consortium (ISC)2, has served as a member of the Nuclear Regulatory Commission’s Nuclear Cyber Security Working Group, and the Vice Chair of the Institute for Information Infrastructure Protection at George Washington University. A retired U.S. Navy Submarine Electronics Limited Duty Officer and Chief Data Systems Technician, he holds a M.S. in Information Systems from George Mason University and his professional credentials include Certified Information Systems Security Professional, Certified Information Security Manager, and Certified Computer Professional.

11:45 am – 12:15 pm: How do People Interact with Mis/Disinformation on Social Media? Lessons from Quantitative and Qualitative Data, Franzi Roesner (University of Washington)

Abstract: Misinformation and disinformation online, and on social media in particular, have become topics of widespread concern. As part of any efforts to combat these phenomena, we must first understand what is happening today: how do people using social media assess and interact with potential mis/disinformation? In this talk, I will discuss our work seeking to shed light on this question for Facebook and Twitter users, leveraging both large-scale data and quantitative methods (analyzing a dataset describing millions of URLs shared publicly on Facebook in 2017-2019) as well as small-scale data and qualitative methods (conducting an in-depth observational study of Facebook and Twitter users). Our findings reveal substantial engagement on Facebook with known and potential mis/disinformation URLs, particularly among certain demographic groups, as well as the varied heuristics that people use when assessing and investigating mis/disinformation that they see on social media. I will reflect on the strengths and limitations of each of our methods, highlight open research questions, and discuss the implications of our findings for future work aiming to reduce the spread of mis/disinformation on social media.

Franzi Roesner

Speaker: Franziska (Franzi) Roesner is an Associate Professor in the Paul G. Allen School of Computer Science & Engineering at the University of Washington, where she co-directs the Security and Privacy Research Lab. Her research focuses broadly on computer security and privacy for end users of existing and emerging technologies. Her work has studied topics including online tracking and advertising, security and privacy for sensitive user groups, security and privacy in emerging augmented reality (AR) and IoT platforms, and online mis/disinformation. She is the recipient of an MIT Technology Review “Innovators Under 35” Award, an Emerging Leader Alumni Award from the University of Texas at Austin, a Google Security and Privacy Research Award, and an NSF CAREER Award. She serves on the USENIX Security and USENIX Enigma Steering Committees. She received her PhD from the University of Washington in 2014 and her BS from UT Austin in 2008.

12:15 pm – 12:45 pm: Why Are Adversarial Samples So Easy To Find?, Patrick Drew McDaniel (Pennsylvania State University)

Abstract: Advances in machine learning have enabled new applications that border on science fiction. Autonomous cars, data analytics, adaptive communication and self-aware software systems are now revolutionizing markets by achieving or exceeding human performance. Yet, we have discovered that adversarial samples—inputs that induce underlying model failure—are often easy to find. In this talk, we review the structure of adversarial sample generation algorithms and explore the reason why they are so often effective through the prism of recent results. We will identify the definitions of security in ML (and reasons for the lack of community consensus on them) and will conclude with a brief discussion on the practical impact of model frailty on threat models and the future of ML-enabled systems.

Patrick McDaniel

Speaker: Patrick Drew McDaniel is the William L. Weiss Professor of Information and Communications Technology and Director of the Institute for Networking and Security Research in the School of Electrical Engineering and Computer Science at the Pennsylvania State University. Professor McDaniel is also a Fellow of the IEEE and ACM and the director of the NSF Frontier Center for Trustworthy Machine Learning. He also served as the program manager and lead scientist for the Army Research Laboratory’s Cyber-Security Collaborative Research Alliance from 2013 to 2018. His research focuses on a wide range of topics in computer and network security and technical public policy. Prior to joining Penn State in 2004, he was a senior research staff member at AT&T Labs-Research.

 12:45 pm – 1:00 pm: Discussion and Q&A

Day 2 (Thursday, August 26)

 11:00 am – 11:05 am: Opening Remarks, Bo Li (University of Illinois at Urbana-Champaign)

11:05 am – 11:35 am: Toward Understanding End-to-End Learning in the Context of Data: Machine Learning Dancing over Semirings and Codd’s Table, Ce Zhang (ETH Zurich)

Abstract: Recent advances in machine learning (ML) systems have made it incredibly easier to train ML models given a training set. However, our understanding of the behavior of the model training process has not been improving at the same pace. Consequently, a number of key questions remain: How can we systematically assign importance or value to training data with respect to the utility of the trained models, may it be accuracy, fairness, or robustness? How does noise in the training data, either injected by noisy data acquisition processes or adversarial parties, have an impact on the trained models? How can we find the right data that can be cleaned and labeled to improve the utility of the trained models? Just when we start to understand these important questions for ML models in isolation recently, we now have to face the reality that most real-world ML applications are way more complex than a single ML model. In this talk, I will revisit these questions for an end-to-end ML pipeline, which consists of a noise model for data and a feature extraction pipeline, followed by the training of an ML model. In the first part of this talk, I will introduce some recent theoretical results in an abstract way: How to calculate the Shapley value of a training example for ML models trained over feature extractors, modeled as a polynomial in the provenance semiring? How to compute the entropy and expectation of ML models trained over data uncertainty, modeled as a Codd Table? As we will see, even these problems are #P-hard for general ML models, though, surprisingly, we can obtain PTIME algorithms for a simpler proxy model (namely a K-nearest neighbor classifier), for a large family of polynomials, input noise distributions, and utilities. I will then put these theoretical results into practice. Given a set of heuristics and a proxy model to approximate a real-world end-to-end ML pipeline into these abstract problems, I will provide a principled framework for three applications: (1) certifiable defence of backdoor attacks, (2) targeted data cleaning for ML, and (3) data valuation and debugging for end-to-end ML pipelines. I will describe both our positive empirical results but also those cases that our current approach failed at.

Ce Zhang

Speaker: Ce Zhang is an Assistant Professor in Computer Science at ETH Zurich. He believes that by making data—along with the processing of data—easily accessible to non-expert users, we have the potential to make the world a better place. His current research focuses on understanding and building next-generation machine learning systems and platforms. Before joining ETH, Ce was advised by Christopher Ré, finished his PhD round-tripping between the University of Wisconsin-Madison and Stanford University, and spent another year as a postdoctoral researcher at Stanford. He contributed to the research efforts that won the SIGMOD Best Paper Award and SIGMOD Research Highlight Award, and was featured in special issues, including in the Science magazine, the Communications of the ACM, “Best of VLDB”, and the Nature magazine. His work has also been reported by media such as the Atlantic, WIRED, and Quanta Magazine.

11:35 am – 12:05 pm: ML (Non-)Robustness and Dataset Biases, Aleksander Madry (Massachusetts Institute of Technology)

Abstract: Our current machine learning models achieve impressive performance on many benchmark tasks. Yet, these models remain remarkably brittle and susceptible to manipulation. Why is this the case? In this talk, we take a closer look at this question, and pinpoint some of the roots of this observed brittleness. Specifically, we discuss how the way current ML models “learn” and our existing datasets are created gives rise to widespread vulnerabilities, and then outline possible approaches to alleviate these deficiencies.

Aleksander Madry

Speaker: Aleksander Madry is a Professor of Computer Science at MIT and leads the MIT Center for Deployable Machine Learning. His research interests span algorithms, continuous optimization, science of deep learning, and understanding machine learning from a robustness and deployability perspectives. Aleksander’s work has been recognized with a number of awards, including an NSF CAREER Award, an Alfred P. Sloan Research Fellowship, an ACM Doctoral Dissertation Award Honorable Mention, and Presburger Award. He received his PhD from MIT in 2011 and, prior to joining the MIT faculty, he spent time at Microsoft Research New England and on the faculty of EPFL.

12:05 pm – 12:35 pm: What Does It Mean to “Verify” a Neural Network?, Zico Kolter (Carnegie Mellon University)

Abstract: In recent years, potential security threats arising from AI models have gained more prominence in the research, industry, and government. The possibility of adversarial attacks, data poisoning, or other malicious threats to AI systems raises significant security concerns. In this setting, a number of techniques have emerged which can provide some level of security against these threats, including the notion of “verifying” the input/output properties of deep neural networks. In this talk, I'll briefly discuss some of the tools and techniques that can be used to provide these types of guarantees, including a collaboration that resulted in the alpha-beta-CROWN tool, a technique that recently ranked first in a broad competition on neural network verification. But more broadly, I will also discuss some of the (potentially intractable) challenges we face in building networks that provide the kinds of guarantees about correctness that we actually would want in practice.

Zico Kolter

Speaker: Zico Kolter is an Associate Professor in the Computer Science Department at Carnegie Mellon University, and also serves as Chief Scientist of AI Research for the Bosch Center for Artificial Intelligence. His work spans the intersection of machine learning and optimization, with a large focus on developing more robust and rigorous methods in deep learning. In addition, he has worked in a number of application areas, highlighted by work on sustainability and smart energy systems. He is a recipient of the DARPA Young Faculty Award, a Sloan Fellowship, and Best Paper awards at NeurIPS, ICML (honorable mention), IJCAI, KDD, and PESGM.

12:35 pm – 1:05 pm: Trustworthy Machine Learning: Past, Present, and Future, Somesh Jha (University of Wisconsin)

Abstract: Fueled by massive amounts of data, models produced by machine-learning (ML) algorithms, especially deep neural networks (DNNs), are being used in diverse domains where trustworthiness is a concern, including automotive systems, finance, healthcare, natural language processing, and malware detection. Of particular concern is the use of ML algorithms in cyber-physical systems (CPS), such as self-driving cars and aviation, where an adversary can cause serious consequences. Interest in this area of research has simply exploded. In this work, we will cover the state-of-the-art in trustworthy machine learning, and then cover some interesting future trends.

Somesh Jha

Speaker: Somesh Jha received his B.Tech from Indian Institute of Technology, New Delhi in Electrical Engineering and his Ph.D. in Computer Science from Carnegie Mellon University under the supervision of Prof. Edmund Clarke (a Turing award winner). Currently, he is the Lubar Professor in the Computer Sciences Department at the University of Wisconsin (Madison). His work focuses on analysis of security protocols, survivability analysis, intrusion detection, formal methods for security, and analyzing malicious code. Recently, he has focussed his interest on privacy and adversarial ML (AML). He has published several articles in highly-refereed conferences and prominent journals and has won numerous best-paper and distinguished-paper awards. He was the recipient of the NSF CAREER award and a Fellow of the ACM and IEEE.

  1:05 pm – 2:00 pm: Discussion and Q&A

Day 3 (Friday, August 27)

 11:00 am – 11:05 am: Opening Remarks, Sean Peisert (Lawrence Berkeley National Laboratory, University of California, Davis)

11:05 am – 11:35 am: Advancing Cybersecurity as an Evidence-Based Discipline, Stefan Savage (University of California, San Diego)

Abstract: Cybersecurity is a field that has generated a broad array of technologies, solutions, and best practices, each of which purports to improve the security of those who employ them. Sadly, we have a dearth of systematic evaluation supporting these claims and thus, most security decisions are driven by a combination of anecdote, intuition, and received wisdom. In this talk, I will use several research case studies-both past and ongoing-to highlight the value of an “evidence-based” security approach, one in which empirical analytic data is based to guide security interventions and assess defenses.

Stefan Savage

Speaker: Stefan Savage is a professor of Computer Science and Engineering at the University of California, San Diego where he also serves as co-director of UCSD’s Center for Network Systems. He received his Ph.D. in Computer Science and Engineering from the University of Washington and a B.S. in Applied History from Carnegie Mellon University. Savage is a full-time empiricist, whose research interests lie at the intersection of computer security, distributed systems, and networking. Savage is a member of the American Academy of Arts & Sciences, a MacArthur Fellow, a Sloan Fellow, an ACM Fellow, and is a recipient of the ACM’s Prize in Computing and SIGOPS Weiser Award. He currently holds the Irwin and Joan Jacobs Chair in Information and Computer Science, but is a fairly down-to-earth guy and only writes about himself in the third person when asked.

11:35 am – 12:05 pm: Cryptography and the Democratizing Power of Learning Nothing, Mayank Varia (Boston University)

Abstract: In a data-driven society, cryptography offers the promise of making data science more accessible by allowing analysts to compute over data that they cannot see. In this talk, I will describe our work to design and deploy systems that compute empowering and equitable analytics while respecting personal privacy. I will also discuss how the machinery of cryptography can provide a new lens through which to view legal questions involving cybersecurity and government power. Overall, I hope to convince you that cryptography has great potential to benefit society but also introduces technology policy challenges of its own.

Mayank Varia

Speaker: Mayank Varia is a Research Associate Professor of Computer Science at Boston University. His research explores the computational and social aspects of cryptography, and his work has been featured in media outlets like CNET, The Hill, and ZDNet. His designs for accessible, equitable, and socially-responsible data analysis have been used to determine the gender wage gap, subcontracting to minority-owned businesses, and repeat offenders of sexual assault inspired by the #MeToo movement. He serves on the United States Advisory Committee on Data for Evidence Building and the United Nations Privacy-Preserving Techniques Task Team to promote the use of cryptographically protected data analysis and shape the laws and policies surrounding its use. He received a PhD in mathematics from MIT in 2010.

12:05 pm – 12:35 pm: Detecting Adversarial Examples in Deep Learning, George Kesidis (Pennsylvania State University)

Abstract: Interest in adversarial deep learning has grown dramatically in the past few years. The existence of adversarial examples, which fool deep neural network classifiers but not human decision-makers, is not just a security threat. Perhaps even more importantly, it is a punctuated demonstration that deep neural networks do not yet achieve robust classification, let alone something akin to true artificial intelligence. One can attempt to incorporate adversarial examples into the training of deep neural networks. However, this type of “robust learning” is limited to only known methods of adversarial-example construction and is also prone to overfitting to the training set. Moreover, such an approach does not detect attacks – it merely enhances robust decision-making in the face of attacks. We discuss two state-of-the-art approaches to detection of adversarial examples for a given trained deep neural network (DNN). The first approach models internal-layer activations of the DNN based on clean examples and, using the resulting class-conditional null, employs a Kullback-Leibler criterion for detection of “low confidence” adversarial examples. The second approach employs a class-conditional Generative Adversarial Network (GAN) and achieves excellent results in detecting both low and high confidence adversarial examples. The internal layer activations that work best are those close to the input (i.e., “front end” features extracted from the DNN). The promising results for GANs-based detection raises the intriguing question of whether a GANs-based approach can replace conventional DNNs and achieve truly robust DNN decisionmaking.

George Kesidis

Speaker: George Kesidis received his MS (in 1990) and PhD (in 1992) in Electrical Engineering and Computer Sciences from the University of California, Berkeley. Following eight years as a professor of Electrical and Computer Engineering at the University of Waterloo, he has been a professor of Computer Science and Engineering and Electrical Engineering at the Pennsylvania State University since 2000. His research interests include many aspects of networking and cyber security, including the impact of economic policy and applications of machine learning. His work has been supported by over a dozen NSF research grants, several Cisco Systems URP gifts (lately for machine learning applications to cyber security), and recent DARPA and AFOSR grants.

 12:35 pm – 12:55 pm: Discussion and Q&A

 12:55 pm – 1:00 pm: Closing Remarks