Observations consisting of measurements on pairs of objects (or conditions) arise in a number of settings in the biological sciences (www.yeastgenome.org), with collections of scientific publications (www.jstor.org) and other hyper-linked resources (www.wikipedia.org), and in social networks (www.linkedin.com). Analyses of such data typically aim at identifying structure among the units of interest, in a low dimensional space, to support the generation of substantive hypotheses, to partially automate semantic categorization, to facilitate browsing, and to simplify complex data into useful patterns, more in general. In this lecture, we will survey a few exchangeable graph models. We will then focus on the stochastic blockmodel and show its utility as a quantitative tool for exploring static/dynamic networks. Within this modeling context, we discuss alternative specifications and extensions that address fundamental issues in data analysis of complex interacting systems: bridging global and local phenomena, data integration, dynamics, and scalable inference.
bio:edo m. airoldi joined the faculty of harvard university in january 2009, as an assistant professor of statistics. he is also a member of the center for systems biology in the faculty of arts and sciences, and a faculty in the systems biology and quantitative genomics phd programs. before joining harvard, he was a postdoctoral fellow at princeton university in the department of computer science and the lewis-sigler institute for integrative genomics. he has received a bachelor's degree in mathematical statistics from bocconi university, and a ph.d. degree in computer science from carnegie mellon university, for which he was awarded the leonard j. savage award honorable mention from the international society of bayesian analysis in 2007. his research interests include statistical methodology and theory with application to molecular biology and integrative genomics, computational social science, and statistical analysis of large biological and information networks. he is a big think delphi fellow (2011-), and has received the john van ryzin award (2006) from the international biometrics society.
Virtually all methods of learning dynamic systems from data start with the same basic assumption: the learning algorithm will be given a time sequence of data generated from the dynamic system. We consider the case where the training data comes from the system's operation but with no temporal ordering. The data are simply drawn as individual disconnected points. While making this assumption may seem absurd atfirst glance, we observe that many scientific modeling tasks have exactly this property.We propose several methods for solving this problem. We write down an approximate likelihood function that may be optimized to learn dynamic models and show how kernel methods can be used to obtain non-linear models. We propose an alternative method that focuses on achieving temporal smoothness in the learned dynamics. Finally, we consider the case where a small amount of sequenced data is available along with a large amount of non-sequenced data. We propose the use of the Lyapunov equation and the non-sequenced data to provide regularization when performing regression on the sequenced data to learn a dynamic model. We demonstrate our methods on synthetic data and describe the results of our analysis of some bioinformatics data sets.
Topic models are learned via a statistical model of variation within document collections, but designed to extract meaningful semantic structure. Desirable traits include the ability to incorporate annotations or metadata associated with documents; the discovery of correlated patterns of topic usage; and the avoidance of parametric assumptions, such as manual specification of the number of topics. We first describe a doubly correlated nonparametric topic (DCNT) model which captures all three of these properties. The DCNT models metadata via a flexible, Gaussian regression on arbitrary input features; correlations via a scalable square-root covariance representation; and nonparametric selection from an unbounded series of potential topics via a stick-breaking construction. This structure can be explored via an intuitive graphical interface.While basic inference algorithms such as the Gibbs sampler are easily applied to hierarchical nonparametric Bayesian models, including the DCNT, in practice they can fail in subtle and hard to diagnose ways. We explore this issue via a simpler nonparametric topic model, the hierarchical Dirichlet process (HDP). Using a carefully crafted search algorithm which finds likely configurations under a marginalized representation of the HDP, we demonstrate the poor mixing behavior of conventional HDP samplingalgorithms. In the process, we illustrate unrealistic biases of the "toy" datasets commonly used to validate inference algorithms. Applied to a collection of scientific documents, our search algorithm also reveals statistical features which are poorly modeled via the HDP.In this talk I will give an overview over an array of highly scalable techniques for both observed and latent variable models. This makes them well suited for problems such as classification, recommendation systems, topic modeling and user profiling. I will present algorithms for batch and online distributed convex optimization to deal with large amounts of data, and hashing to address the issue of parameter storage for personalization and collaborative filtering. Furthermore, to deal with latent variable models I will discuss distributed sampling algorithms capable of dealing with tens of billions of latent variables on a cluster of 1000 machines. The algorithms described are used for personalization, spam filtering, recommendation, document analysis, and advertising.
In this talk I will give an overview over an array of highly scalable techniques for both observed and latent variable models. This makes them well suited for problems such as classification, recommendation systems, topic modeling and user profiling. I will present algorithms for batch and online distributed convex optimization to deal with large amounts of data, and hashing to address the issue of parameter storage for personalization and collaborative filtering. Furthermore, to deal with latent variable models I will discuss distributed sampling algorithms capable of dealing with tens of billions of latent variables on a cluster of 1000 machines. The algorithms described are used for personalization, spam filtering, recommendation, document analysis, and advertising.
Bio:Bee-Chung Chen is currently a research scientist at Yahoo! Research. His research interests include recommender systems, large scale data analysis and scalable methods for modeling and mining. He received his Ph.D. from the University of Wisconsin - Madison with an outstanding graduate student research award from the Department of Computer Sciences. His work on "cube-space data mining" is recognized by the ACM SIGMOD Doctoral Dissertation Award Honorable Mention. His recent work on "explore/exploit schemes for Web content optimization" won the ICDM 2009 best paper award. He is a key designer of the recommendation algorithms that power a number of major Yahoo! sites.
Bio:Nanda Kambhatla has nearly two decades of research experience in the areas of Natural Language Processing (NLP), text mining, information extraction, dialog systems, and machine learning. He holds 7 U.S patents and has authored over 40 publications in books, journals, andconferences in these areas. Nanda holds a B.Tech in Computer Science and Engineering from the Institute of Technology, Benaras Hindu University, India, and a Ph.D in Computer Science and Engineering from the Oregon Graduate Institute of Science & Technology, Oregon, USA.Currently, Nanda is the senior manager of the Human Language Technologies department at IBM Research - India, Bangalore. He leads a group of over 20 researchers focused on research in the areas of NLP,advanced text analytics (IE, IR, sentiment mining, etc.), speech analytics and statistical machine translation. Most recently, Nanda was the manager of the Statistical Text Analytics Group at IBM's T.J. Watson Research Center, the Watson co-chair of the Natural Language Processing PIC, and the task PI for the Language Exploitation Environment (LEE) subtask for the DARPA GALE project. He has been leading the development of information extraction tools/products and his team has achieved top tier results in successive Automatic Content Extraction (ACE) evaluations conducted by NIST for extracting entities, events and relations from text from multiple sources, in multiple languages and genres.Earlier in his career, Nanda has worked on natural language web-based and spoken dialog systems at IBM. Before joining IBM, he has worked on information retrieval and filtering algorithms as a senior research scientist at WiseWire Corporation, Pittsburgh and on image compressionalgorithms while working as a postdoctoral fellow under Prof. Simon Haykin at McMaster University, Canada. Nanda's research interests are focused on NLP and technology solutions for creating, storing,searching, and processing large volumes of unstructured data (text, audio, video, etc.) and specifically on applications of statistical learning algorithms to these tasks.
In this talk, I will begin with an overview of statistical challenges that arise in recommender system problems for web applications like content optimization, online advertising. I will then describe some modeling solutions for a content optimization problem that arises in the context of the Yahoo! Front Page. In particular, I will discuss time series models to track item popularity, explore-exploit/sequential design schemes to enhance performance and matrix factorization models to personalize content to users. For some of the methods, I will present experimental results from an actual system at Yahoo!. I will also provide examples of other applications where the techniques are useful and end with discussion of some open problems in the area. 2b1af7f3a8