Our NIPS'16 Spotlight Video is out.
The website for the course CS565500 Large-Scale Machine Learning is now online.
My lab is recruiting graduate students who have passion in:
Prospective students are welcome to have a discussion with me to better understand our topics and projects. Please send me an email with your resume (including your transcript and past projects) first to arrange the meeting time.
I am currently an associate professor at the Department of Computer Science, National Tsing Hua University (NTHU), Taiwan. Since 2016, I also serve as Division Director for the Division of Academic Information System, Computer & Communication Center of NTHU. My research interests include:
I received the Ph.D. degree in Electrical Engineering from the National Taiwan University, Taiwan (2005/09 - 2009/02). Before joining NTHU in 2010, I was a senior research scientist at Telcordia Technologies Inc. (formerly Bellcore) during 2004 and 2010.
I am also a programmer. At my leisuref, I write of some interesting software.
See DBLP for the boring list of my publications.
Ting-Yu Cheng, Kuan-Hua Lin, Xinyang Gong, Kang-Jun Liu, and Shan-Hung Wu, "Learning User Perceived Clusters with Feature-Level Supervision," in Advances In Neural Information Processing Systems (NIPS), December 2016
* NIPS is a top conference in the field of Machine Learning.
Semi-supervised clustering algorithms have been proposed to identify data clusters that align with user perceived ones via the aid of side information such as seeds or pairwise constrains. However, traditional side information is mostly at the instance level and subject to the sampling bias, where non-randomly sampled instances in the supervision can mislead the algorithms to wrong clusters. In this paper, we propose learning from the feature-level supervision. We show that this kind of supervision can be easily obtained in the form of perception vectors in many applications. Then we present novel algorithms, called Perception Embedded (PE) clustering, that exploit the perception vectors as well as traditional side information to find clusters perceived by the user. Extensive experiments are conducted on real datasets and the results demonstrate the effectiveness of PE empirically.
Yan-Fu Liu, Cheng-Yu Hsu, and Shan-Hung Wu, "Non-Linear Cross-Domain Collaborative Filtering via Hyper-Structure Transfer," in Proc. of the 32nd Int'l Conf. on Machine Learning (ICML), July 2015
* ICML is a top conference in the field of Machine Learning.
The Cross Domain Collaborative Filtering (CDCF) exploits the rating matrices from multiple domains to make better recommendations. Existing CDCF methods adopt the sub-structure sharing technique that can only transfer linearly correlated knowledge between domains. In this paper, we propose the notion of Hyper-Structure Transfer (HST) that requires the rating matrices to be explained by the projections of some more complex structure, called the hyper-structure, shared by all domains, and thus allows the non-linearly correlated knowledge between domains to be identified and transferred. Extensive experiments are conducted and the results demonstrate the effectiveness of our HST models empirically.
Shan-Hung Wu, Hao-Heng Chien, Kuan-Hua Lin, and Philip S. Yu, "Learning the Consistent Behavior of Common Users for Target Node Prediction across Social Networks," in Proc. of the 31st Int'l Conf. on Machine Learning (ICML), June 2014
In this work, We study the target node prediction problem: given two social networks, identify those nodes/users from one network (called the source network) who are likely to join another (called the target network, with nodes called target nodes). Although this problem can be solved using existing techniques in the field of cross domain classification, we observe that in many realworld situations the cross-domain classifiers perform sub-optimally due to the heterogeneity between source and target networks that prevents the knowledge from being transferred. In this paper, we propose learning the consistent behavior of common users to help the knowledge transfer. We first present the Consistent Incidence Co-Factorization (CICF) for identifying the consistent users, i.e., common users that behave consistently across networks. Then we introduce the Domain-UnBiased (DUB) classifiers that transfer knowledge only through those consistent users. Extensive experiments are conducted and the results show that our proposal copes with heterogeneity and improves prediction accuracy.
Shan-Hung Wu, Tsai-Yu Feng, Meng-Kai Liao, Shao-Kan Pi, and Yu-Shan Lin, "T-Part: Partitioning of Transactions for Forward-Pushing in Deterministic Database Systems," in Proc. of the 2016 ACM Int'l Conf. on Management of Data (SIGMOD), June 2016
* ACM SIGMOD is a top conference in the field of Database Systems and Big Data Management.
Deterministic database systems, a type of NewSQL database systems, have been shown to yield high throughput on a cluster of commodity machines while ensuring the strong consistency between replicas, provided that the data can be well-partitioned on these machines. However, data partitioning can be suboptimal for many reasons in real-world applications. In this paper, we present T-Part, a transaction execution engine that partitions transactions in a deterministic database system to deal with the unforeseeable workloads or workloads whose data are hard to partition. By modeling the dependency between transactions as a T-graph and continuously partitioning that graph, T-Part allows each transaction to know which later transactions on other machines will read its writes so that it can push forward the writes to those later transactions immediately after committing. This forward-pushing reduces the chance that the later transactions stall due to the unavailability of remote data. We implement a prototype for T-Part. Extensive experiments are conducted and the results demonstrate the effectiveness of T-Part.
Shan-Hung Wu, Man-Ju Chou, Chun-Hsiung Tseng, Yuh-Jye Lee, Kuan-Ta Chen, "Detecting In-Situ Identity Fraud on Social Network Services: A Case Study on Facebook," in IEEE Systems Journal, to appear
With the growing popularity of Social Networking Services (SNSs), increasing amounts of sensitive information are stored online and linked to SNS accounts. The obvious value of SNS accounts gives rise to the identity fraud problem--unauthorized, stealthy use of SNS accounts. For example, anxious parents may use their children's SNS accounts to spy on the children's social interaction; or husbands/wives may check their spouses' SNS accounts if they suspect infidelity. Stealthy identity fraud could happen to anyone and seriously invade the privacy of account owners. However, there is no known defense against such behavior when an attacker, possibly an acquaintance of the victim, gets access to the victim's computing devices. In this paper, we propose to extend to use of continuous authentication to detect the in-situ identity fraud incidents, which occurs when the attackers use the same accounts, the same devices and IP addresses as the victims. Using Facebook as a case study, we show that it is possible to detect such incidents by analyzing SNS users' browsing behavior. Our experiment results demonstrate that the approach can achieve higher than 80% detection accuracy within 2 minutes, and over 90% after 7 minutes of observation time.
Shan-Hung Wu, Ching-Chan Wu, Wing-Kai Hon, and Kang G. Shin, "Rendezvous for Heterogeneous Spectrum-Agile Devices," in Proc. of the 33rd Int'l Conf. on Computer Communications (INFOCOM), April 2014
* IEEE INFOCOM is a top conference in the field of Networking and Applications.
Cognitive radio (CR) is intended to meet the exponentially growing demand for spectrum by allowing for opportunistic utilization of idle legacy channels. Rendezvous, where two radios complete handshaking in an idle channel, is a key step for “stranger” (unknown to each other) CRs to start communication. However, none of existing algorithms guarantee rendezvous for heterogeneous or stranger CRs with different spectrum-sensing capabilities, in spite of the fact that (i) a wide variety of mobile devices are equipped with heterogeneous radios and (ii) there are numerous applications requiring efficient rendezvous for heterogeneous radios/CRs. In this paper, we propose a new channel hopping algorithm, called Heterogeneous Hopping (HH), that guarantees rendezvous without assuming existence of a universal channel set that can be sensed by all radios. HH is realized with a two-layer design that harmonizes the fixed-short-cycle and parity-alignment techniques we propose here, in order to guide CRs to rendezvous in two complementary situations resulting from the different capabilities of mobile wireless devices. To best of our knowledge, HH is the first channel-hopping scheme that guarantees rendezvous between heterogeneous radios. Our in-depth evaluation has shown HH to be significantly faster than simple extensions of existing schemes. Moreover, the latter cannot guarantee successful rendezvous, either.
"Big Data behind Small Apps," in AI Forum 2016, AEARU-CSWT 2016
The marketplace for mobile apps is very crowded and competitive today. Many developers take the "start from small" strategy to gradually acquire the early adopters from a niche and find the product-market-fit. However, this strategy leads to some general misconception that gaining popularity requires the founders to have profound experience in marketing and/or resources.
In this talk, I share my personal experience in running some apps using a scientific approach speaking above the data collected by the apps. Technically, we found that some of the data analysis tasks are more challenging than the well-known "V's" barriers of big data analytics. Commercially, we show that innovative execution/marketing strategies aided by specialized machine learning techniques are a key for driving unfair business advantage.
"Asymmetric Support Vector Machines: A Tutorial," in AI Forum 2010
This is a tutorial of my work published in KDD'08. Many practical applications of classification require the classifier to produce a very low false-positive rate. Although the Support Vector Machine (SVM) has been widely applied to these applications due to its superiority in handling high dimensional data, there are relatively little effort other than setting a threshold or changing the costs of slacks to ensure the low false-positive rate. In this paper, we propose the notion of Asymmetric Support Vector Machine (ASVM) that takes into account the false-positives and the user tolerance in its objective. Such a new objective formulation allows us to raise the confidence in predicting the positives, and therefore obtain a lower chance of false-positives. We study the effects of the parameters in ASVM objective and address some implementation issues related to the Sequential Minimal Optimization (SMO) to cope with large-scale data. An extensive simulation is conducted and shows that ASVM is able to yield either noticeable improvement in performance or reduction in training time as compared to the previous arts.
Teaching is my first priority. I spent lots of time on preparing the course materials and wish every student can get the best of the topics covered. Therefore, most of my courses give heavy loads. Please make sure you reserve enough time before taking them. If you have any question or suggestion about my courses, please feel free to send me an email.
This course presents hands-on labs for students to be familiar with the software development process, techniques, and tools. Students are asked to build real, useful applications (websites and/or mobile apps) accessible to the public.
This course provides an overview of the current database management systems in the cloud, and explains how they are different from traditional database systems. The goal is to get students familiar with some well-known implementations like NoSQL databases, Google BigTable, Google MegaStore, and Google Spanner etc., and more importantly, to help students make better decisions on the design tradeoffs when configuring/building their own database systems given a particular set of target applications (tenants) in mind.
Large-Scale Machine Learning
This class provides a practical guide for students to perform large-scale data analysis with open-source tools. We bring machine learning theory, tools, and real-world datasets together to teach students how to analyze massive data effectively and efficiently.
This course is divided into 3 parts. In the first part, we review some maths required by machine learning. In the second part, we introduce fundamental machine learning concepts/models/algorithms. And lastly, in part 3 we discuss the large-scale machine learning for big data and how it differs from small-scale learning tasks. In particular, we will focus on the artificial neural networks and deep learning techniques and explain why they become popular (again) recently.
This course emphasizes both theory and coding. It is intended for senior undergraduate and graduate students who have proper understanding of computer programming, probability, calculus, and linear algebra. In particular, we will use Python as the main programming language throughout the course. Although being helpful, the background knowledge of large-scale machine learning and deep learning tools/libraries such as Spark and Theano is not necessary.
Machine Learning (Advanced)
This is an advanced course on machine learning. We will introduce the deep generative models and more engineering aspects of machine learning. In addition, we will cover the reinforcement learning and how it can leverage the game theory and deep learning to interact with the environments better.
Detailed syllabus will be announced later.
Past: Spring 2011-2012, Fall 2013-2015
App Entrepreneurship and Implementation
Past: Spring 2013-2014
To members: please join this group first.
I am very lucky to be able to work with many smart and creative students in DataLab. Our lab locates at Delta 723 and 724. People are nice there (although busy sometimes). Please feel free to stop by and chat with us to know more about our projects.
I like to turn ideas into real things. At my leisure, I write code with my students to transform our research results into fun and easy-to-use software that benefits the general audience:
ElaSQL is a distributed relational database system prototype that aims to offer high scalability, high availability, and elasticity to the on-line transaction processing (OLTP) applications.
VanillaDB is a collection of simple-to-read, fast, and extensible database system components aiming to lower the barrier of new-system prototyping and/or learning the database internals.
Hop is an app that lets you share fun moments without a care and get biggest fans. It's like group Snapchat, but better!
Flora is an app that helps you stay focused on a task with your friends. It's a companion to the Forest app initiated by a former DataLab member Shao-Kan Pi.