Skip to content

Latest commit

 

History

History
101 lines (50 loc) · 15.3 KB

软件工程师必备的机器学习技能.md

File metadata and controls

101 lines (50 loc) · 15.3 KB

软件工程师必备的机器学习技能

原文链接:Machine learning skills for software engineers

也许你做机器学习不是为了成为数据科学家,但是你确实需要学习一些数据分析的技能。让我们从这里开始吧

Machine learning skills for software engineers

Thinkstock

Ted Dunning是MapR Technologies的首席应用架构师。

很久以前,在20世纪50年代中期,Robert Heinlein写了一个名为“进入盛夏之门(A Door into Summer)”的故事,其中一位称职的机械工程师利用“Thorsen记忆管”和一些用于判断的“集成电路”催生出一个整个智能机器人行业。为了使这个故事更加合理,作者将故事的背景设定在1970年,那时候这些机器人可以通过完成别人洗碗这个任务,完美地复制它,即“学会”洗碗。

可能我需要告诉你,事实并非如此。这个故事可能在1956年看起来似乎很合理,但真到了1969年,很明显智能机器人不会出现在1970年。过了一段时间,人们发现很明显故事不会发生在1980年,也不会发生在1990年或2000年。每10年 ,普通工程师建造人工智能机器的能力似乎至少和时间一样快。 随着技术的进步,问题的巨大难度变得清晰,因为发现了一层又一层的困难。(译者注:《进入盛夏之门》是一本1950年左右关于智能机器人,时空穿越的科幻小说)

[评论:TensorFlow,Spark MLlib,Scikit-learn,MXNet,Microsoft Cognitive Toolkit,and Caffe机器学习和深度学习框架。| 综述:13 frameworks for mastering machine learning。|借助 InfoWorld Daily newsletter了解尖端企业技术的关键新闻和问题。]

机器学习并非不能解决重要问题;它其实可以解决。例如,到90年代中期,基本上所有的信用卡交易都是使用神经网络进行诈骗扫描。到了90年代末,谷歌正在分析网络上的高级信号以帮助搜索。但是软件工程师是没有机会建立这样一个系统,除非他们回到学校攻读博士学位并找到了一群志同道合的朋友才会做这样的事情。机器学习很难,每个新领域都需要打破大量的新瓶颈,即使是最优秀的研究人员也无法解决现实世界中的图像识别等难题。

我很高兴,如今已经发生了巨大变化。虽然我们中的任何人都很难在不久的将来建立一个Heinlein风格的,自动神奇的,全机器人的工程公司,但现在,对于没有经历过特殊培训的软件工程师依然可能做出很棒的东西。其中令人惊讶的不是计算机可以做这些事情(毕竟从1956年我们就知道未来的每一天都可以做到),而是在过去十年里我们走了多远。 10年前可以获得一个好的博士学位的项目,如今可能只需要一周就能完成。

机器学习变得越来越容易(或者至少更容易获得)

在我们即将出版的“Machine Learning Logistics”一书中(2017年9月下旬由O'Reilly著),Ellen Friedman和我讲述了一个名为TensorChicken的系统,这个系统是我们的朋友和软件工程师Ian Downard已经建成了一个有趣的家庭项目。要解决的问题是,蓝鸟总会进入我们朋友的鸡舍并偷吃鸡蛋。他想建立一个可以识别蓝鸟的计算机视觉系统,以便采取某种行动来阻止他偷吃。

在看到TensorFlow团队的Google工程师深入学习演示之后,Ian开始破解并构建了这样一个系统。 他首先构建了一个名为Inception-v3的局部模型,然后通过他的鸡舍中的网络摄像头拍摄了几千张新照片,并进行模型训练。最后把训练结果部署在树莓派上,但是为了使得响应速度快,我们需要更强大的功能,例如英特尔酷睿i7处理器。

而Ian并非一个人。现实中有各种各样的人,其中许多人 没有 并非数据科学家,但他们做出了很酷的机器人来做各种各样的事情。越来越多的开发人员开始研究各种不同的,严肃的机器学习项目,因为他们认识到机器学习甚至深度学习变得更容易获得(利用)。开发人员开始以“数据操作”的工作方式担任数据工程师,即开发人员将聚焦数据的技能(数据工程,架构师,数据科学家)与开发程序的方法想结合来构建机器学习系统。

使用图像识别模型居然可以很容易地训练计算机发现蓝鸟,这真的令人深刻。在许多情况下,普通人可以坐下来做这种事情甚至更多,而所需要的只是指向有用技术的一些指导,并在脑海框架中稍微重构一下,尤其是如果您主要用它进行软件开发。

建模不同于开发普通软件,因为它是数据驱动而不是设计驱动的。你必须从经验的角度来看整个系统,同时你需要的是数学方程的实验证明,而不是努力实现那些具有有单元测试和集成测试的良好设计。另外,在机器学习中有一些本不困难的问题,它可能可以变得非常容易。然而,可能其他的问题无论怎么变都仍然非常困难,因为它需要更复杂的数据科学技能,包括更多的数学知识。因此,解决方法就是测试一下,即在你确认问题属于简单类别之前,不要赌上你的农村或鸡舍(不要钻牛角尖,投入使用)。甚至在它看起来第一次起作用之后,依然不赌上农场。就像所有优秀的数据科学家一样怀疑好看的结果,并多次测试。

机器学习初学者的基本数据处理技能

本文接下来部分主要讲述了开发人员学习使用机器学习所需的一些技能和策略。

从数据开始

在良好的软件工程环境中,通常你可以推断设计,编写软件,并直接和独立地验证解决方案的正确性。 在某些情况下,您甚至可以在数学上证明您的软件是正确的。 现实世界确实有点干扰,特别是涉及人类时,但如果你有良好的规格,你可以实施正确的解决方案。

通过机器学习,您通常没有严格的规范。 您拥有代表过去系统体验的数据,您必须构建一个将来可以使用的系统。 要判断您的系统是否真的有效,您必须在实际情况下测量性能。 切换到这种数据驱动的,规范较差的开发风格可能很难,但如果您想构建内部具有机器学习的系统,那么这是一个关键步骤。 In good software engineering, you can often reason out a design, write your software, and validate the correctness of your solution directly and independently. In some cases, you can even mathematically prove that your software is correct. The real world does intrude a bit, especially when humans are involved, but if you have good specifications, you can implement a correct solution.

With machine learning, you generally don’t have a tight specification. You have data that represents the past experience with a system, and you have to build a system that will work in the future. To tell if your system is really working, you have to measure performance in realistic situations. Switching to this data-driven, specification-poor style of development can be hard, but it is a critical step if you want to build systems with machine learning inside.

学会发现更好的模型

Learn to spot the better model

比较两个数字很容易。 假设它们都是有效值(不是NaN),你可以检查哪个更大,然后就完成了。 然而,当谈到机器学习模型的准确性时,它并不那么简单。 对于您要比较的模型,您有很多结果,通常没有一个简洁明了的答案。 构建机器学习系统的最基本技能几乎是能够查看两个模型所做出的决策历史,并确定哪种模型更适合您的情况。 这种判断需要基本的技术来思考具有整个价值云而不是单一价值的价值观。 它通常还要求您能够很好地可视化数据。 将需要直方图和散点图以及许多相关技术。 Comparing two numbers is easy. Assuming they are both valid values (not NaN’s), you check which is bigger, and you are done. When it comes to the accuracy of a machine learning model, however, it isn’t so simple. You have lots of outcomes for the models you are comparing, and there isn’t usually a clean-cut answer. Pretty much the most basic skill in building machine learning systems is the ability to look at the history of decisions that two models have made and determine which model is better for your situation. This judgment requires basic techniques to think about values that have an entire cloud of values rather than a single value. It also typically requires that you be able to visualize data well. Histograms and scatter plots and lots of related techniques will be required.

Be suspicious of your conclusions

Along with the ability to determine which variant of a system is doing a better job, it is really important to be suspicious of your conclusions. Are your results a statistical fluke that will go the other way with more data? Has the world changed since your evaluation, thus changing which system is better? Building a system with machine learning inside means that you have to keep an eye on the system to make sure that it is still doing what you thought it was doing to start with. This suspicious nature is required when dealing with fuzzy comparisons in a changing world.

Build many models to throw away

It is a well-worn maxim in software development that you will need to build one version of your system just to throw away. The idea is that until you actually build a working system, you won’t really understand the problem well enough to build that system well. So you build one version in order to learn and then use that learning to design and build the real system.

With machine learning, the situation is the same, but more so. Instead of building just one disposable system, you must be prepared to build dozens or hundreds of variants. Some of these variants might use different learning technologies or even just different settings for the learning engine. Other variants might be completely different restatements of the problem or the data that you use to train the models. For instance, you might determine that there is a surrogate signal that you could use to train the models even if that signal isn’t really what you want to predict. That might give you 10 times more data to train with. Or you might be able to restate the problem in a way that makes it simpler to solve.

The world may well change. This is particularly true, for instance, when you are building models to try to catch fraud. Even after you build a successful system, you will need to change in the future. The fraudsters will spot your countermeasures, and they will change their behavior. You will have to respond with new countermeasures.

So for successful machine learning, plan to build a bunch of models to throw away. Don’t expect to find a golden model that is the answer forever.

Don’t be afraid to change the game

The first question that you try to solve with machine learning is usually not quite the right one. Often it is dramatically the wrong one. The result of asking the wrong question can be a model that is nearly impossible to train, or training data that is impossible to collect. Or it may be a situation where a model that finds the best answer still has little value.

Recasting the problem can sometimes give you a situation where a very simple model to build gives very high value. I had a problem once that was supposed to do with recommendation of sale items. It was really hard to get even trivial gains, even with some pretty heavy techniques. As it turned out, the high value problem was to determine when good items went on sale. Once you knew when, the problem of which products to recommend became trivial because there were many good products to recommend. At the wrong times, there was nothing worth recommending anyway. Changing the question made the problem vastly easier.

Start small

It is extremely valuable to be able to deploy your original system to just a few cases or to just a single sub-problem. This allows you to focus your effort and gain expertise in your problem domain and gain support in your company as you build models.

Start big

Make sure that you get enough training data. In fact, if you can, make sure that you get 10 times more than you think you need.

Domain knowledge still matters

In machine learning, figuring out how a model can make a decision or a prediction is one thing. Figuring out what really are the important questions is much more important. As such, if you already have a lot of domain knowledge, you are much more likely to ask the appropriate questions and to be able to incorporate machine learning into a viable product. Domain knowledge is critical to figuring out where a sense of judgment needs to be added and where it might plausibly be added.

Coding skills still matter

There are a number of tools out there that purport to let you build machine learning models using nothing but drag-and-drop tooling. The fact is, most of the work in building a machine learning system has nothing to do with machine learning or models and has everything to do with gathering training data and building a system to use the output of the models. This makes good coding skills extremely valuable. There is a different flavor to code that is written to manipulate data, but that isn’t hard to pick up. So the basic skills of a developer turn out to be useful skills in many varieties of machine learning.

Many tools and new techniques are becoming available that allow practically any software engineer to build systems that use machine learning to do some amazing things. Basic software engineering skills are highly valuable in building these systems, but you need to augment them with a bit of data focus. The best way to pick up these new skills is to start today in building something fun.

---

Ted Dunning is chief applications architect at MapR Technologies and a board member for the Apache Software Foundation. He is a PMC member and committer of the Apache Mahout, Apache Zookeeper, and Apache Drill projects and a mentor for various incubator projects. He was chief architect behind the MusicMatch (now Yahoo Music) and Veoh recommendation systems and built fraud detection systems for ID Analytics (LifeLock). He has a Ph.D. in computing science from University of Sheffield and 24 issued patents to date. He has co-authored a number of books on big data topics including several published by O’Reilly related to machine learning. Find him on Twitter as @ted_dunning.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.