Popular data science meetup groups in Warsaw include over 1000 members, and its events and afterparties are amazing opportunity to meet new, interesting people working with data in various industries. That also means we already have a significant source of valuable informations full of undiscovered insights. I was attending one of data science meetups, when I started to wonder… How many of the attendees in this room actually work as data scientists and what are the backgrounds of everyone else? I decided to find an answer using R and Python. I chose the largest Warsaw meetups and gathered their attendees data for further analysis (ordered alphabetically): <a href="http://www.meetup.com/Big-Data-Warsaw/">Big Data Warsaw</a>, <a href="http://www.meetup.com/Data-Science-Warsaw/">Data Science Warsaw</a>, <a href="http://www.meetup.com/Machine-Learning-Warsaw/">Machine Learning Warsaw</a>, <a href="http://www.meetup.com/PyData-Warsaw/">PyData</a>, <a href="http://www.meetup.com/QlikWAW/">Qlik</a>, <a href="http://www.meetup.com/Spotkania-Entuzjastow-R-Warsaw-R-Users-Group-Meetup/">Warsaw R Enthusiasts</a> and <a href="http://www.meetup.com/warsaw-hug/" target="_blank" rel="noopener noreferrer">Warsaw Hadoop User Group</a>. You can find my results in a Shiny dashboard <a href="https://pawelp.shinyapps.io/meetup-analysis/" target="_blank" rel="noopener noreferrer">published here</a>. <h2 id="assumptions">Assumptions</h2> After getting familiar with the gathered meetup data I decided to classify attendees by their public professional profiles on social media. I chose to create the following classes of attendees: <ul><li><em>Dev</em> - software developers (Java/iOS/JS/.NET/Scala/…), tech-leads, webdevs, QA, etc</li><li><em>Business</em> - HR, managers, business consultants, owners, founders, PR, marketing, etc</li><li><em>Data Scientist</em> - having in their job description areas like data science, data analysis, big data, machine learning and similar. In particular, this group contains software developers who deal with data science/machine learning on a daily basis.</li><li><em>Student</em></li><li><em>Academic</em> - working at the universities, PhD students, professors, etc</li><li><em>Other</em> - unable to classify to any other group / spam / no public profile data.</li></ul> It’s common occurance for someone can belong to multiple groups like a student working as a data science developer, I had to classify each person to only one of the categories. <h2 id="how-many-data-scientists-are-there">How many Data Scientists are there?</h2> The results are interesting, but not surprising: <!--html_preserve--> <iframe src="https://plot.ly/~przytu1/19.embed" width="100%" height="500" frameborder="0" scrolling="no"></iframe> <!--/html_preserve--> There is a big disparity in attendee proportions for particular meetups. For example, Warsaw Hadoop User Group targets mainly developers; let’s also see how these proportions look like for particular meetups: <!--html_preserve--> <center> <iframe src="https://plot.ly/~przytu1/25.embed" width="100%" height="1500" frameborder="0" scrolling="no"></iframe></center> <!--/html_preserve--> Basic observations: <ul><li>Meetups with the largest proportion of developers are Hadoop User Group, Machine Learning Warsaw and PyData Warsaw. Almost 1 in 2 attendees is a developer.</li><li>1 in 5 attendees of Warsaw R Enthusiasts meetup work professionaly as data scientist.</li><li>Meetups with the largest proportion of business people is Qlik. Unfortunately the group of classified people is relatively small, but such result are intuitively expected.</li></ul> <h2 id="gathering-data">Gathering data</h2> Now let’s see how many attendees were classified for each selected meetup: <!--html_preserve--> <iframe src="https://plot.ly/~przytu1/15.embed" width="100%" height="400" frameborder="0" scrolling="no"></iframe> <!--/html_preserve--> Here’s how the gathering algorithm steps looked like: <ol><li>Collecting raw data from meetup website using Chrome plugin - <a href="https://data-miner.io/">DataMiner</a>. This tool contains ready <a href="https://en.wikipedia.org/wiki/XPath" target="_blank" rel="noopener noreferrer">XPath</a> sets for popular websites, so it was easy to get raw data, even if they were available only for logged users.</li><li>Drop names who looked like nicknames or were anonimized (like “John D.”). Simple rule was effective enough: <code class="highlighter-rouge">number of name parts >= 2 (at least name and surname) && number of letters in each name part >= 3</code></li><li>Scrap data from public professional profiles. This was the most technical and challenging part. I used a popular library <a href="https://github.com/scrapinghub/scrapy-splash" target="_blank" rel="noopener noreferrer">Scrapy in connection with Splash</a> - JavaScript rendering engine. Some of the websites intentionally hide their content by returning scripts instead of full, rendered DOM structure.</li><li>Classify attendees based on keywords found in gathered job titles and descriptions. This process was supported by manual selection of common useful keywords. My intention was to answer my initial question, not to create a state-of-the-art classifier.</li></ol> I assumed that collected names are unique and represent one and the same person in all meetups. This doesn’t have to be true. In total, I managed to gather 1873 unique names from the meetup groups. 70% of them were not anonymous and 60% of them allowed me to enter them into one of the created classes. <!--html_preserve--> <iframe src="https://plot.ly/~przytu1/17.embed" width="100%" height="400" frameborder="0" scrolling="no"></iframe> <!--/html_preserve-->
Explore Possibilities
Share Your Data Goals with Us
From advanced analytics to platform development and pharma consulting, we craft solutions tailored to your needs.
Talk to our Experts