Application Exploration of Machine Learning in Natural Language Processing and Computer Vision

Peiheng Qin

doi:10.71222/tpec0416

Authors

Peiheng Qin School of Computer Science, Faculty of Engineering, The University of Sydney, Sydney, Australia Author

DOI:

https://doi.org/10.71222/tpec0416

Keywords:

machine learning, natural language processing, computer vision, deep learning, transformer, convolutional neural network, multimodal learning

Abstract

Machine learning in general, and deep learning in particular, has revolutionised both natural language processing (NLP) and computer vision (CV) over the last decade. Performance on benchmark tasks consistently exceeds previous state of the art, and in many cases approaches or exceeds human-level accuracy. This paper provides a structured exploration of the main applications of machine learning in these two domains, exploring the architectural innovations -- from convolutional neural networks to the Transformer -- that have propelled progress, and analysing representative applications including text classification, machine translation, large language models, image classification, object detection, and generative visual modelling. The paper also reviews the convergence of NLP and CV through multimodal architectures such as CLIP and BLIP-2 that have resulted in cross-modal reasoning and new application domains. The main obstacles are noted as computing expense, data bias and restrictions in interpretability and the paper discusses future research paths focusing on efficient models and multimodal reasoning. We seek to give a thorough comparative analysis of ML applications in the two domains, and to discover shared architectural paths that indicate a similar future for the research on intelligent systems.

References

1. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of deep bidirectional transformers for language understanding," in Proceedings of NAACL-HLT 2019, Association for Computational Linguistics, Minneapolis, 2019, pp. 4171–4186.

2. L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe, "Training language models to follow instructions with human feedback," in Advances in Neural Information Processing Systems, vol. 35, NeurIPS, New Orleans, 2022, pp. 27730–27744.

3. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," in Advances in Neural Information Processing Systems, vol. 30, NeurIPS, Long Beach, 2017, pp. 5998–6008.

4. Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," Nature, vol. 521, no. 7553, pp. 436–444, 2015.

5. A. Krizhevsky, I. Sutskever, and G. E. Hinton, "ImageNet classification with deep convolutional neural networks," in Advances in Neural Information Processing Systems, vol. 25, NeurIPS, Lake Tahoe, 2012, pp. 1097–1105.

6. K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of IEEE CVPR 2016, IEEE, Las Vegas, 2016, pp. 770–778.

7. T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, "Language models are few-shot learners," in Advances in Neural Information Processing Systems, vol. 33, NeurIPS, 2020, pp. 1877–1901.

8. J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, "Scaling laws for neural language models," arXiv preprint, arXiv:2001.08361, 2020.

9. OpenAI, "GPT-4 technical report," arXiv preprint, arXiv:2303.08774, 2023.

10. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, "An image is worth 16x16 words: Transformers for image recognition at scale," in Proceedings of ICLR 2021, OpenReview, 2021.

11. K. He, G. Gkioxari, P. Dollár, and R. Girshick, "Mask R-CNN," in Proceedings of IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2961–2969.

12. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, "Generative adversarial nets," in Advances in Neural Information Processing Systems, vol. 27, NeurIPS, Montreal, 2014, pp. 2672–2680.

13. J. Ho, A. Jain, and P. Abbeel, "Denoising diffusion probabilistic models," in Advances in Neural Information Processing Systems, vol. 33, NeurIPS, 2020, pp. 6840–6851.

14. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, "Learning transferable visual models from natural language supervision," in Proceedings of ICML 2021, vol. 139, PMLR, 2021, pp. 8748–8763.

15. J. Li, D. Li, S. Savarese, and S. Hoi, "BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models," in Proceedings of ICML 2023, vol. 202, PMLR, 2023, pp. 19730–19742.

Application Exploration of Machine Learning in Natural Language Processing and Computer Vision

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite

ISSN

Make a Submission

Indexing & Abstracting