Weak supervision

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search

Weak supervision is a branch of machine learning where noisy, limited, or imprecise sources are used to provide supervision signal for labeling large amounts of training data in a supervised learning setting.[1] This approach alleviates the burden of obtaining hand-labeled data sets, which can be costly or impractical. Instead, inexpensive weak labels are employed with the understanding that they are imperfect, but can nonetheless be used to create a strong predictive model.[2][3][4]

Problem of labeled training data[edit]

Machine learning models and techniques are increasingly accessible to researchers and developers; the real-world usefulness of these models, however, depends on access to high-quality labeled training data.[5] This need for labeled training data often proves to be a significant obstacle to the application of machine learning models within an organization or industry.[1][dead link] This bottleneck effect manifests itself in various ways, including the following examples:

Insufficient quantity of labeled data

When machine learning techniques are initially used in new applications or industries, there is often not enough training data available to apply traditional processes.[6] Some industries have the benefit of decades' worth of training data readily available; those that do not are at a significant disadvantage. In such cases, obtaining training data may be impractical, expensive, or impossible without waiting years for its accumulation.

Insufficient subject-matter expertise to label data

When labeling training data requires specific relevant expertise, creation of a usable training data set can quickly become prohibitively expensive.[6] This issue is likely to occur, for example, in biomedical or security-related applications of machine learning.

Insufficient time to label and prepare data

Most of the time required to implement machine learning is spent in preparing data sets.[6] When an industry or research field deals with problems that are, by nature, rapidly evolving, it can be impossible to collect and prepare data quickly enough for results to be useful in real-world applications. This issue could occur, for example, in fraud detection or cybersecurity applications.

Other areas of machine learning exist that are likewise motivated by the demand for increased quantity and quality of labeled training data but employ different high-level techniques to approach this demand. These other approaches include active learning, semi-supervised learning, and transfer learning.[1][dead link]

Types of weak labels[edit]

Weak labels are intended to decrease the cost and increase the efficiency of human efforts expended in hand-labeling data. They can take many forms, and might be categorized into three types:

  • Global statistics on groups of inputs: This setting consists in accessing global information on bags of samples — e.g. knowing that half of the labels of a given subset of samples. Examples of global statistics supervision include multiple-instance learning[7] and learning from label proportion.[8]
  • Weak classifiers: A second approach consists in assuming the access to many weak classifiers that weakly correlate with the function to learn. Those classifiers might model labelers from a crowdsourcing platform, experts, noisy measurements or heuristic rules. More generally, developers may take advantage of existing resources (such as knowledge bases, alternative data sets, or pre-trained models[1]) to create labels that are helpful, though not perfectly suited for the given task.[9]
  • Incomplete annotation: Finally, weak supervision might be understood as the access to partial knowledge on each label. This partial knowledge can be thought of as a corruption process.[10] In some instances, partial observation can be cast as a set of potential labels that are compatible with this partial observation, which is the setting of partial supervision.[11][12] Partial supervision is a generalization of semi-supervised learning, which has been the classical approach to overcome the bottleneck of data annotation.

Beyond those three settings, limitations that motivates weakly supervised learning might be tackled by leveraging human knowledge under the form of priors[13] or of function architectures, reviving old approaches of artificial intelligence such as inductive logic programming.

Applications of weak supervision[edit]

Applications of weak supervision are numerous and varied within the machine learning research community.

In 2014, researchers from UC Berkeley made use of the principles of weak supervision to propose an iterative learning algorithm that solely depends on labels generated by heuristics and alleviates the need of collecting any ground-truth labels.[14][15] The algorithm was applied to smart meter data to learn about the household's occupancy without ever asking for the occupancy data, which has raised issues of privacy and security as covered by an article in IEEE Spectrum.[16]

In 2018, researchers from UC Riverside proposed a method to localize actions/events in videos using only weak supervision, i.e., video-level labels, without any information about the start and end time of the events while training. Their work [17] introduced an attention-based similarity between two videos, which acts as a regularizer for learning with weak labels. Thereafter in 2019, they introduced a new problem [18] of event localization in videos using text queries from users, but with weak annotations while training. Later in a collaboration with NEC Laboratories America a similar attention-based alignment mechanism with weak labels was introduced for adapting a source semantic segmentation model to a target domain.[19] When the weak labels of the target images are estimated using the source model, it is unsupervised domain adaptation, requiring no target annotation cost, and when the weak labels are acquired from an annotator, it incurs a very small amount of annotation cost and falls under the category of weakly-supervised domain adaptation, which is first introduced in this work for semantic segmentation.

Stanford University researchers created Snorkel, an open-source system for quickly assembling training data through weak supervision.[20] Snorkel employs the central principles of the data programming paradigm,[9] in which developers create labeling functions, which are then used to programmatically label data, and employs supervised learning techniques to assess the accuracy of those labeling functions.[21] In this way, potentially low-quality inputs can be used to create high-quality models. Afterward, the Stanford AI Lab researchers created Snorkel AI, which originated from the Snorkel project, using state-of-the-art programmatic data labeling and weak supervision approaches, successfully decreasing AI development costs and time significantly. [22]

In a joint work with Google, Stanford researchers showed that existing organizational knowledge resources could be converted into weak supervision sources and used to significantly decrease development costs and time.[23]

In 2019, Massachusetts Institute of Technology and Google researchers released cleanlab, the first standardized Python package for machine learning and deep learning with noisy labels.[24] Cleanlab implements confident learning,[25][26] a framework of theory and algorithms for dealing with uncertainty in dataset labels, to (1) find label errors in datasets, (2) characterize label noise, and (3) standardize and simplify research in weak supervision and learning with noisy labels.[27]

Researchers at University of Massachusetts Amherst propose augmenting traditional active learning approaches by soliciting labels on features rather than instances within a data set.[28]

Researchers at Johns Hopkins University propose reducing the cost of labeling data sets by having annotators provide rationales supporting each of their data annotations, then using those rationales to train both discriminative and generative models for labeling additional data.[29]

Researchers at University of Alberta propose a method that applies traditional active learning approaches to enhance the quality of the imperfect labels provided by weak supervision.[30]

References[edit]

  1. ^ a b c d Alex Ratner, Stephen Bach, Paroma Varma, Chris Ré And referencing work by many other members of Hazy Research. "Weak Supervision: A New Programming Paradigm for Machine Learning". The Stanford AI Lab Blog. Retrieved 2022-03-22.{{cite web}}: CS1 maint: multiple names: authors list (link)
  2. ^ Campagner, Andrea; Ciucci, Davide; Svensson, Carl Magnus; Figge, Marc Thilo; Cabitza, Federico (2021). "Ground truthing from multi-rater labeling with three-way decision and possibility theory". Information Sciences. 545: 771–790. doi:10.1016/j.ins.2020.09.049. S2CID 225116425.
  3. ^ Zhou, Zhi-Hua (2018). "A Brief Introduction to Weakly Supervised Learning" (PDF). National Science Review. 5: 44–53. doi:10.1093/NSR/NWX106. S2CID 44192968. Archived from the original (PDF) on 22 February 2019. Retrieved 4 June 2019.
  4. ^ Nodet, Pierre; Lemaire, Vincent; Bondu, Alexis; Cornuéjols, Antoine; Ouorou, Adam (2021). "From Weakly Supervised Learning to Biquality Learning: An Introduction". 2021 International Joint Conference on Neural Networks (IJCNN). pp. 1–10. arXiv:2012.09632. doi:10.1109/IJCNN52387.2021.9533353. ISBN 978-1-6654-3900-8. S2CID 237450775.
  5. ^ "Datasets Over Algorithms". Space Machine. Retrieved 2019-06-05.
  6. ^ a b c Roh, Yuji (8 Nov 2018). "A Survey on Data Collection for Machine Learning: A Big Data - AI Integration Perspective". arXiv:1811.03402 [cs.LG].
  7. ^ Dietterich, Thomas G.; Lathrop, Richard H.; Lozano-Pérez, Tomás (1 January 1997). "Solving the multiple instance problem with axis-parallel rectangles". Artificial Intelligence. 89 (1–2): 31–71. doi:10.1016/S0004-3702(96)00034-3.
  8. ^ Quadrianto, Novi; Smola, Alex J.; Caetano, Tibério S.; Le, Quoc V. (2009). "Estimating Labels from Label Proportions". Journal of Machine Learning Research. pp. 2349–2374.
  9. ^ a b Ré, Christopher; Selsam, Daniel; Wu, Sen; De Sa, Christopher; Ratner, Alexander (2016-05-25). "Data Programming: Creating Large Training Sets, Quickly". arXiv:1605.07723v3 [stat.ML].
  10. ^ Rooyen, Brendan van; Williamson, Robert C. (2018). "A Theory of Learning with Corrupted Labels". Journal of Machine Learning Research. pp. 1–50.
  11. ^ Hüllermeier, Eyke (2014). "Learning from imprecise and fuzzy observations: Data disambiguation through generalized loss minimization". International Journal of Approximate Reasoning. 55 (7): 1519–1534. arXiv:1305.0698. doi:10.1016/j.ijar.2013.09.003.
  12. ^ Cabannes, Vivien; Rudi, Alessandro; Bach, Francis (21 November 2020). "Structured Prediction with Partial Labelling through the Infimum Loss". International Conference on Machine Learning. PMLR. pp. 1230–1239.
  13. ^ Mann, Gideon S.; McCallum, Andrew (2010). "Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data". Journal of Machine Learning Research. pp. 955–984.
  14. ^ Jin, Ming; Jia, Ruoxi; Kang, Zhaoyi; Konstantakopoulos, Ioannis; Spanos, Costas (2014). "PresenceSense: zero-training algorithm for individual presence detection based on power monitoring". Proceedings of the 1st ACM Conference on Embedded Systems for Energy-Efficient Buildings: 1–10. doi:10.1145/2674061.2674073. S2CID 46950525.
  15. ^ Jin, Ming; Jia, Ruoxi; Spanos, Costas (2017). "Virtual occupancy sensing: using smart meters to indicate your presence". IEEE Transactions on Mobile Computing. 16 (11): 3264–3277. arXiv:1407.4395. doi:10.1109/TMC.2017.2684806. S2CID 1997078.
  16. ^ "What does smart meter know about you?". IEEE Spectrum.
  17. ^ Paul, Sujoy; Roy, Sourya; Roy-Chowdhury, Amit K. (2018). "W-TALC: Weakly-supervised Temporal Activity Localization and Classification". European Conference on Computer Vision (ECCV). arXiv:1807.10418.
  18. ^ Mithun, Niluthpol Chowdhury; Paul, Sujoy; Roy-Chowdhury, Amit K. (2019). "Weakly Supervised Video Moment Retrieval From Text Queries". Computer Vision and Pattern Recognition (CVPR). arXiv:1904.03282.
  19. ^ Paul, Sujoy; Tsai, Yi-Hsuan; Schulter, Samuel; Roy-Chowdhury, Amit K.; Chandraker, Manmohan (2020). "Domain Adaptive Semantic Segmentation Using Weak Labels". European Conference on Computer Vision (ECCV). arXiv:2007.15176.
  20. ^ "Snorkel and The Dawn of Weakly Supervised Machine Learning · Stanford DAWN". dawn.cs.stanford.edu. Retrieved 2019-06-05.
  21. ^ "Snorkel by HazyResearch". hazyresearch.github.io. Retrieved 2019-06-05.
  22. ^ "Snorkel AI scores $35M Series B to automate data labeling in machine learning". TechCrunch. Retrieved 2021-10-08.
  23. ^ Malkin, Rob; Ré, Christopher; Kuchhal, Rahul; Alborzi, Houman; Hancock, Braden; Ratner, Alexander; Sen, Souvik; Xia, Cassandra; Shao, Haidong (2018-12-02). "Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale". Proceedings. Acm-Sigmod International Conference on Management of Data. 2019: 362–375. arXiv:1812.00417. Bibcode:2018arXiv181200417B. doi:10.1145/3299869.3314036. PMC 6879379. PMID 31777414.
  24. ^ "Announcing cleanlab: a Python Package for ML and Deep Learning on Datasets with Label Errors". l7.curtisnorthcutt.com. Retrieved 2020-02-04.
  25. ^ "An Introduction to Confident Learning: Finding and Learning with Label Errors in Datasets". l7.curtisnorthcutt.com. Retrieved 2020-02-04.
  26. ^ Northcutt, Curtis G.; Jiang, Lu; Chuang, Isaac L. (2019-10-31). "Confident Learning: Estimating Uncertainty in Dataset Labels". arXiv:1911.00068 [stat.ML].
  27. ^ Northcutt, Curtis. "CleanLab for Finding and Learning with Noisy Labels". GitHub. Retrieved 9 October 2019.
  28. ^ Druck, Gregory. "Active Learning by Labeling Features" (PDF). Retrieved 4 June 2019.
  29. ^ Zaidan, Omar. "Machine Learning with Annotator Rationales to Reduce Annotation Cost" (PDF). Retrieved 4 June 2019.
  30. ^ Nashaat, Mona; Ghosh, Aindrila; Miller, James; Quader, Shaikh; Marston, Chad; Puget, Jean-Francois (December 2018). "Hybridization of Active Learning and Data Programming for Labeling Large Industrial Datasets". 2018 IEEE International Conference on Big Data (Big Data). Seattle, WA, USA: IEEE: 46–55. doi:10.1109/BigData.2018.8622459. ISBN 9781538650356. S2CID 59233854.