[P5-DS] My Data Science Path 2019 sept-4th week
Subject:How to analyze a data-set set from kaggle & How about IBM DS Course
Hello everyone welcome back to my data science journey , it’s my 5th week it means 5th post ,How I moved to Learning to Practicing ,yeah i think my learning will help me to take a data-set till the model development , if you read my old posts you will understand what are the things i learned right.
1) Let’s Talk about where i faced difficulty
A s we all know the real world data is really a messy one , so i planned to visit to kaggle and take some dataset i will do some analysis , so i visited kaggle.com and searched for India dataset it showed me few result . so i planed of taking Startup India Dataset and i downloaded and loaded it , there comes the real task.
All of the columns in Object Type , when i convert df[“Date”] to datetime it’s not converted , df[“Amount”] to float it’s not converted , and a sound from my mind “let’s do this ” so i gone through each and every column it’s not a cleaned data , it’s still messy
so i found what are the things making my df[“Date”] to not convert-able column :
for i in Date.unique():
if len(i) >10 :
print(i)
elif i[-5] != "/":
print(i)---------------------------------------
output:
05/072018
01/07/015
\\xc2\\xa010/7/2015
12/05.2015
13/04.2015
15/01.2015
22/01//2015
so these are the date’s that caused me problem i fixed those
And moved to df[“Amount]:
I found lots of characters and symbols in the amount columns so i wrote code that only scrape the [0–9] and [ . ] with the help of re library.
import redef modifiy_int(amt): if amt is np.nan:
return np.nan
x = re.compile(“[\d|.]”)
x=x.findall(amt)
if len(x) >0 : return ‘’.join(x)
else:
return np.nan
And this solved my problems after that i filled all the NaN values with mean and mode so it finished my data cleaning area , and moved to visualization the DataSet source have a good questions.
Possible questions which could be answered are:How does the funding ecosystem change with time?
Do cities play a major role in funding?
Which industries are favored by investors for funding?
Who are the important investors in the Indian Ecosystem?
How much funds does startups generally get in India?
so i plotted few graphs on based that question ,
And I planned to take more dataset for my practice for next week till the September ends.
2)So what about IBM data science ?
I’m not a judge to say the result , I’ll say my opinion the course i took is Data Analysis with pandas part of IBM DS module. In this weak trust me i completed the 4 weeks course in a half-day the 5th weak is model creation i haven’t viewed the videos because i have planned the path to learn Machine Learning because I already know how the Algorithms works Theoretically (5–6 months ago) so i need to learn the math way. I think they haven’t coverd all the pandas function , EDA but they covered the Most Famous techniques.
3) Resources of the Week:
kaggle : Trust me we need to do lots of practice , Tera bytes of practice , while practice we can learn a lot , no matter at what stage you’re now , try to analyze a dataset with what you have learned.