Highlights from ODSC Immersive AI Event in NYC
Attending conferences, events, meet-ups, or just socializing with people that have shared interests (in my case DATA), is beneficial for sharing ideas and learning new concepts. Below is a quick snapshot of some of the sessions that I took part in at the Open Data Science Conference (ODCS) Immersive AI event in NYC (June 28-29, 2019).
Introduction to Machine Learning with Scikit-learn – Andreas C. Muller
The goal of machine learning is to make predictions about entities from past data; it is important to have the right data collected.
Common Types of Machine Learning:
- Supervised – most commonly used (e.g. spam detection, medical diagnosis, ad-click prediction)
- Unsupervised – algorithm used to draw inferences from datasets consisting of input data without labeled responses
- Reinforcement – you don’t work on the data set you’ve collected; an agent interacting with an environment (AlphaGo example)
Other types of Machine Learning
- Semi-Supervised – mix of supervised and unsupervised
- Active Learning
Classification & Regression
- Classification – target y is discrete (will you pass an exam? yes or no)
- You want to make an exact prediction; don’t want to make a mistake
- Regression – target y is continuous (how many points will you get on the exam?)
- Approximately correct answer is more acceptable (you can be off by a few points on an exam); you can be “close” to the answer
Relationship to Statistics
- Statistics – emphasis on model first (inference); making statements about the world; you have a population and you want to make a statement about the population
- Machine Learning – emphasis is on data first (prediction) – it’s more specific to an individual in the population – interested in making accurate predictions vs. creating an accurate model (doesn’t matter if it reflects the real world population, as long as it makes an accurate prediction)
- Goal is not to make an accurate model; it’s to make more money or improve lives; measure of how well we are doing isn’t MSE or predictive accuracy. The metric we care about (e.g. customer satisfaction, impact of what we are doing on the metric we care about)
- Cost of complex systems – people want to build fancy things (like deep learning) – start by thinking what are the goals, what metrics can help track progress, then collect data to see how well you’re doing. Don’t start with machine learning first. Start with simple heuristics. If you do use machine learning, start with simple models first (such as regression). Compare it to predicting just one class; keep it as simple as possible, for as long as possible. When you introduce complexity into the system make sure it really helps.
- Ethical considerations – ethics need to be considered when working with data. Think about potential bias. Algorithms are not unbiased; it is driven by the data you’re providing it.
Training and Test Data
To measure how much we can trust the model; use training set to train and test set to evaluate how well the model will do.
Data loading time!
We will classify images into numbers; we’ll start with how many items there are in the class.
Data is always numpy array (or sparse matrix) of shape (n_samples, n_features)
Split the data and get going –
Define the training and test data sets
Supervised ML Workflow
- Fit method on training data; random forest classifier; then there is a predict method to compare the labels with test labels. You can also predict internally and compute accuracy with the score
- Nearest neighbor classifier to explain ML algorithms. 2 features (y and x axis). Blue dots and red dots; for a new dot, will it be blue or red?
- What is the nearest point of the data set and use the class of the nearest point; very simple model
- KNN with scikit-learn
- KNN.fit builds the model – in KNN it just remembers the data; simple
- We can also look at 3 nearest neighbors to vote on which class the new point is assigned; different n neighbors get different predictions
- The more n neighbors – increased number of mistakes for classification
- Tuning parameters; can control how ragged or complicated the model is – how much do you want to fit the data set (overfitting and under fitting).
- Boxplot – to scale the data; to see the order of magnitude for features
- It’s important to scale; because you can see what to expect when applying K-nearest neighbors
- If you don’t scale the data set, the Y axis is ignored; if the distances are tiny compared to the x axis.
- Common method – Z scaling – each feature individually gets scaled (subtract mean and divide by standard deviation)
- Min – max scaling- scale between min and max value (like 0 and 1) – this makes sense for features that have clear boundaries (doesn’t work for gaussian)
- Normalizer – projected – if you have counts of different things; how often do they do a set of actions- divides by squared sum of all of them to make it into a histogram
- Preprocessing should be do on training data set; test data set is our pretend future dataset – so we should treat it the same as future data
- Important to apply the exact scaling transformation that you did for training set – to the test set
- Ordinal encoding – converted string variable to categorical variable and can use integer encoding – assign numbers to the strings (in this example boros)
- data.head() provides the first 5 rows of data
Building data science teams –Drew Conway
Transparency of your interview process is valuable at every step of the process
Your interview is a product of your team’s culture
Heavily asymmetrical interview process can hurt you – tell candidates more about your company/ product
- Pre interview
- Write a specific job requirement
- if you are hiring you must need help; what do you need help with; don’t be cryptic; lot of transparency
- Build the take-home exercise
- the exercise should directly reflect your work
- the more effort you put in building it; the more value you will get out
- phone screen
- answer candidates questions about us
- will this candidate successfully complete the take-home
- send take-home
- respect their availability
- provide guardrails; don’t spend more than x hours of time on this
- use the tools that you actually use (don’t say they can use any tool if there’s a specific tool you need)
- review take-home
- did the code meet the technical requirements?
- do we want to learn more about this person?
- Write a specific job requirement
- Meeting 1 code review
- Pair with dev and or data science
- Live code review; question approach; understand logic
- Meeting 2 project planning
- Create new req’s from take home- can you articulate a project that we might do around this little project you build in take home
- How does a candidate take ownership – of process of solving the problem and the creative problems
- Meeting 3 – project discovery
- Technical deep dive on the proposed project; what tools would you use, how long will it take/ cost, resources ,etc. to build this
- Can take the form of pre-mortem, or similar
- Meeting 4 – company values /culture fit
- Rank our values from most to least important
- No right answers just want to learn and discuss
- Meeting 1 code review
- Post interview
- Consensus no hire – inform candidate
- Failed interviews are costly
- Consensus hire – behind putting together offer
- Need more information – occasionally team wants a little more info; schedule a 60 min
- call to review specific topic directly from meetings
- Consensus no hire – inform candidate
Programming with Data Foundations of Python and Pandas –
Focused on working with tabular data; with a focus on Python/ Pandas
Pandas is a large library; we’ll focus on some of the more useful parts of pandas
What is a series?
- Univariate datasets; a series – it’s what everything else in pandas is built on.
- Ordered key-value pairs with homogenous data type
- A data array and label array
A simple series
Import numpy as np
Import pandas as pd
Numpy is a lower level matrix manipulation
First series – us a list for our first series
Dtype is the series datatype s.astype(np.float64)
Series from a pandas perspective
A mapping from an index to values
Dictionary – dict preserve insertion order
Operations in pandas are implicitly aligned by index
Visualizations with Python – Kimberly Fessel, Metis
- Matplotlib is the introductory library, highly customizable, rubbish defaults, lengthy code to customize, it’s designed to act like matlab
- Seaborn – updated default style, matplotlib code still valid, advanced visuals (heatmaps, box plots, contour plots), less code for quality visuals
- Plotly – interactive visuals, tooltips (like tableau), nested dictionaries, can’t really export the interactive visual, you can pay to use dash but no free workaround now.
Data Visuals for Communication
- Practitioners better understand complex patterns
- Collaboration enhanced among colleagues
- Facilitate stakeholders and specialists in making data-driven decisions
Google Colab/Jupyter Notebooks
- Executes Python code on the fly
- Interactivity allows for instant feedback
- Memory persists across cells
- Use markdown (TEXT) mode for adding text like this
Explicit titles tell people what they need to know
ML: A Framework for Understanding the AL Landscape & Terminology –David Yakobovitch, Galvanize
Check out his http://www.humainpodcast.com/about/ podcast
Cool article “Future of today Institute”: https://futuretodayinstitute.wetransfer.com/downloads/ff42727439bd928007628c18d160958a20190309150253/fb38f1
By 2025 20 million jobs are being automated (robotics)
Data & AI Landscape 2019: https://mattturck.com/data2019/
Data Science Workflow
- Identify the problem
- Acquire the data
- Refine the data
- Build data models
- Communicate your results
Github – site that coding community uses to store code, textbooks, open source code – Jupyter notebooks.
APIs – data gateways – programmable web; google for APIs – type in weather or something.
Thank you to ODSC for putting together this great event, was really a good use of my time. I learned a lot and was able to interact with some data friends; as well as meeting new people that love DATA!