Jekyll2021-03-06T17:01:27+00:00https://albertyumol.github.io//feed.xmlbash.is.42Data Scientist and Tech For the People ActivistAlbert YumolClass Demo: Scraping your Facebook posts with BeautifulSoup2020-05-20T00:00:00+00:002020-05-20T00:00:00+00:00https://albertyumol.github.io//fb-bs<div id="fb-root"></div>
<script async="" defer="" src="https://connect.facebook.net/en_US/sdk.js#xfbml=1&version=v3.2"></script>
<p>It has been a while since my last post. The corona crisis really altered my schedule and plans for 2020. But we need to adapt, move forward and proceed to our new normal. I am lucky right now that I am able to write again, be able to breathe freely.</p>
<p>To the reader, sending you virtual hugs. I know that we are in rough times, but this too shall pass.</p>
<p>This post is a demonstration I used in my Web Scraping class for Eskwelabs. The goal is to scrape facebook posts (your own) and make a <code class="language-plaintext highlighter-rouge">Word Cloud</code> from it.</p>
<p>Data Science is not only about fancy algorithms, statistics and math. It is also about credibly and creatively sourcing your data. Given that the bulk of data is in the internet hiding in the backend or scattered in the frontend of websites, ethical web scraping (also called as data mining/web crawling) is indeed a handy technique.</p>
<p>The <a href="https://en.wikipedia.org/wiki/Facebook%E2%80%93Cambridge_Analytica_data_scandal">Cambridge Analytica Scandal</a> took its toll and changed the data mining landscape giving focus to data ownership and privacy. Personal data of millions of Facebook users where scraped and used without consent for political advertising and campaigns.</p>
<p>It was a good thing that in 2018, the European Union pushed through implementing the General Data Protection Regulation (GDPR) that put privacy of people with their digital data at a premium.</p>
<p>Because of this a lot of tech companies mandated that their users will have the option to download their personal data (although it is unclear if these data is complete). At the very least we have a sense of ownership and visibility to our own data. If ever Facebook sells my info at least I know what info they got from me. And maybe think of how they might use this data. Let’s start scraping!</p>
<h3 id="step-1-download-facebook-data">Step 1: Download Facebook Data</h3>
<p>Log-in to your facebook account. Go to settings and click download your information.</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/fb/fb_1.png" alt="download_info" class="center" /></p>
<p>For the purposes of this tutorial/demo, we are only interested in your posts. So untick all check boxes except for posts.</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/fb/fb_2.png" alt="posts" class="center" /></p>
<p>Here are the recommended filters:</p>
<ul>
<li>Date Range: All of my data</li>
<li>Format: HTML</li>
<li>Media Quality: Low (for faster download, we are interested to just texts anyway)</li>
</ul>
<p>Click create file. You will be notified when it is ready to download. You will be prompted to download a zip file. Download and unzip it.</p>
<p>Open the unzipped folder, find the <code class="language-plaintext highlighter-rouge">posts</code> folder and locate the <code class="language-plaintext highlighter-rouge">your_posts_1.html</code> file.</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/fb/fb_3.png" alt="posts_html" class="center" /></p>
<p>Now let’s do us some Python 😁</p>
<p>You need to install <a href="https://anaconda.org/anaconda/beautifulsoup4">BeautifulSoup</a> and <a href="https://anaconda.org/conda-forge/wordcloud">WordCloud</a></p>
<p>We now use beautiful soup to make a hot fudge of html mess.</p>
<script src="https://gist.github.com/albertyumol/896ddf060e98727b4f7b46f745a1fa49"></script>
<p>To get the text of posts that I posted, let us check first the source code using my favorite browser’s (MOZILLA FIREFOX) inspector. Right click on a post and click inspect element (or whatever counterpart you have with your browser but I say switch to FIREFOX, the most <a href="https://www.zdnet.com/article/germanys-cyber-security-agency-recommends-firefox-as-most-secure-browser/">secure browser</a>).</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/fb/fb_4.png" alt="inspect" class="center" /></p>
<p>Navigate along and do some pattern recognition to identify the correct class to filter. My guess would be <code class="language-plaintext highlighter-rouge">_2pin</code>.</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/fb/fb_5.png" alt="filter" class="center" /></p>
<p>So let’s find all the div’s and filter all classes in <code class="language-plaintext highlighter-rouge">_2pin</code>. Checking the contents, we are correct that <code class="language-plaintext highlighter-rouge">_2pin</code> captures all of the posts. The following code demonstrates how it’s done. I made a list <code class="language-plaintext highlighter-rouge">x</code> to contain all of the posts.</p>
<script src="https://gist.github.com/albertyumol/51b3ee04d9d0902299f1d7c253efc421.js"></script>
<p>We concatenate the list into one string then we visualize the words using Word Cloud.</p>
<script src="https://gist.github.com/albertyumol/ebc8f3716ebf97339b7edfeebe4b3a06.js"></script>
<p>You will notice that there are some words that are not really relevant like https, PM. We technically call them <code class="language-plaintext highlighter-rouge">stop words</code> in Natural Language Processing (NLP).</p>
<p>We remove them,</p>
<script src="https://gist.github.com/albertyumol/25d490c0c1577de95a6979e2e6daa1f3.js"></script>
<p>And here is our final output:</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/fb/fb_6.png" alt="output" class="center" /></p>
<p>In summary, we did a quick demonstration on how to download your data from Facebook, use BeautifulSoup to parse it, and create a wordcloud from scraped text. Scraping can be expanded to many applications in business and industry problems e.g. for data augmentation in competition and market research. Stay tuned for more tutorials on web scraoing, next would be scraping from APIs.</p>
<p>References:</p>
<ul>
<li><a href="http://socialdata.site/chapter_04/">Social Data</a></li>
</ul>
<p>Questions? Contact me via <a href="https://ph.linkedin.com/in/albertyumol">LinkedIn</a>. I’m also on GitHub with the username <a href="https://github.com/albertyumol">albertyumol</a>.</p>
<script async="" src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
<script>
(adsbygoogle = window.adsbygoogle || []).push({
google_ad_client: "ca-pub-6410209740119334",
enable_page_level_ads: true
});
</script>
<div class="fb-comments" data-href="https://albertyumol.github.io/" data-numposts="5"></div>Albert YumolData Science is not only about fancy algorithms, statistics and math. It is also about credibly sourcing your data. Given that the bulk of data is in the internet hiding in the backend or scattered in the frontend of websites, web scraping is indeed a handy technique.Making sense of the world with data2020-03-30T00:00:00+00:002020-03-30T00:00:00+00:00https://albertyumol.github.io//corona<div id="fb-root"></div>
<script async="" defer="" src="https://connect.facebook.net/en_US/sdk.js#xfbml=1&version=v3.2"></script>
<p>Data science is a very powerful tool that can guide our understanding of real world events. In times of crisis, we may get confused and emotional. Data science gives us a lense to be objective. It introduces concrete solutions to concrete material conditions. It is a science after all which follows sound logic, is repeatable, and insightful.</p>
<p>This article is to give you an idea of the data science process in conducting exploratory data analysis (EDA). The context is using the nCov data from DOH (found here <a href="https://docs.google.com/spreadsheets/d/16g_PUxKYMC0XjeEKF6FPUBq2-pFgmTkHoj5lbVrGLhE/edit?fbclid=IwAR1qRr3hTxSiQ8KdymZiIQfPX4CpSA4VezpNKqXIPCIMQI1H3xMTGJ16lMs#gid=0">COVID-19 Philippines</a>) and population data from PSA (found here <a href="http://openstat.psa.gov.ph/PXWeb/pxweb/en/DB/DB__1A__PO/1001A6DTPR0.px/?rxid=7513be1c-0ada-4a03-909c-6f03e8b2d402&fbclid=IwAR1Vcfp-d-cfIF_ujsyDJICUCL6zPpTNS-51E5K8rFj373XzB_v7kbTllzE">Total population by age group, sex, region</a>).</p>
<p>This is all done in Python using a Jupyter Notebook with the basic tools of data wrangling, Pandas, Numpy, and Matplotlib.</p>
<p>Data from DOH is in a spreadsheet format so when you import it please make sure you have the proper Google APIs enabled.</p>
<p>We will not try to predict anything. We should leave it to the epidemiologists and experts whose been doing it for their entire careers. I will link some of their works below. I will also list down the efforts in the Philippines using data science to help with the Corona Crisis.</p>
<h3 id="importing-data">Importing data</h3>
<p>First I used the libraries <code class="language-plaintext highlighter-rouge">gspread</code> and <code class="language-plaintext highlighter-rouge">oauth2client</code> to access the spreadsheet. You can also download the spreadsheet as a CSV and import using the standard <code class="language-plaintext highlighter-rouge">pd.read_csv</code>. The advantage of the method that I used is that everytime I run the code, it will fetch the latest numbers from the spreadsheet as compared to downloading the CSV every single time for an updated dataset.</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/eda_corona/1.png" alt="1" class="center" /></p>
<p>Here are the columns indicated in the spreadsheet.</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/eda_corona/2.png" alt="2" class="center" /></p>
<h3 id="asking-questions-of-the-data">Asking questions of the data</h3>
<p>Suppose that we are interested with the running total of cases in the Philippines as of March 31, 2020, we can run:</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/eda_corona/3.png" alt="3" class="center" /></p>
<p>Now, the goal of our EDA is to breakdown this number. Let us look first at the binned age distribution.</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/eda_corona/4.png" alt="4" class="center" /></p>
<p>The average age of nCov positive patients is around <code class="language-plaintext highlighter-rouge">54</code>. The distribution is near symmetric at the <code class="language-plaintext highlighter-rouge">60</code> age mark. This is interesting to see because the population of the Philippines is young. In fact, <a href="https://www.indexmundi.com/philippines/age_structure.html">89.35%</a> of Filipinos are aged 55 and below. This tells us that there is indeed a significant number of older people who gets infected by the corona virus.</p>
<p>Now let’s look at gender,</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/eda_corona/gender1.png" alt="gender" class="center" /></p>
<p>We see that around <code class="language-plaintext highlighter-rouge">62%</code> are male. We can only infer certain reasons to this. Maybe males are more exposed due to the nature of their jobs or maybe it can also be attributed to sanitation and personal <a href="https://www.nst.com.my/world/world/2020/03/572170/men-worse-bathroom-hygiene-prevents-covid-19">hygene</a>. We cannot claim this entirely and we need more factors and indicators to prove this.</p>
<p>It’s easier to interpret if gender and age group and age group are visualized together.</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/eda_corona/combined_age_gender.png" alt="mix" class="center" /></p>
<p>Then again, people in statistics have a better way of doing this through a pyramid plot. In the case of the Philippines,</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/eda_corona/pyramid_PH.png" alt="pyramid" class="center" /></p>
<p>The first three cases of the virus are from Chinese nationals.</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/eda_corona/china.png" alt="china" class="center" /></p>
<p>It might be interesting to see if other nationalities have been infected with the corona virus inside the PH as well.</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/eda_corona/nation.png" alt="nation" class="center" /></p>
<p>Indeed, we see that that is the case. Although majority of the cases are now Filipinos there are foreign nationals who have been tested positive inside the Philippines.</p>
<p>Given these insights, that corona crisis started from foreign sources, it is logical for us to ask,</p>
<h3 id="how-did-our-domestic-cases-acquired-corona">How did our domestic cases acquired Corona?</h3>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/eda_corona/travel.png" alt="travel" class="center" /></p>
<p>We see that around <code class="language-plaintext highlighter-rouge">72%</code> of those who tested positive didn’t have recen relevant travel history. This confirms the statement of the experts that community transmission is already happening.</p>
<p>Interestingly, in the study of disease spread, epidemiologists always factor in links. You will have a high likelihood of contracting the disease if you interacted someone who has been infected and is a carrier of the virus. Plotting the patient links in our data set,</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/eda_corona/link.png" alt="link" class="center" /></p>
<p>An alarming <code class="language-plaintext highlighter-rouge">97%</code> has no link to previous patients. This tells us that the current contact tracing is not efficient as we are not able to monitor the connections of contractions. (Note: This number includes those that are tagged as <code class="language-plaintext highlighter-rouge">for validation</code> so definitely this is an overestimate. But given the weight of the matter (and the clock is ticking), an overestimate is better than an underestimate as we are not already able to confirm information in real time.)</p>
<p>Most of the locations of the patients are concentrated in Metro Manila.</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/eda_corona/loc.png" alt="loc" class="center" /></p>
<p>But it is starting to trickle down to the provinces.</p>
<h3 id="evolutions-through-time">Evolutions through time</h3>
<p>On a more optimistic note, we see the count of new cases decreasing as of the week of this posting,</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/eda_corona/new_cases.png" alt="new_cases" class="center" /></p>
<p>For us to see the pattern better, we can look at the cumulative count of cases,</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/eda_corona/cummul1.png" alt="cummul1" class="center" /></p>
<p>This tells us that although the numbers are increasing, the rate at which they increase are losing momentum which is a good indication.</p>
<p>We can also look at the number of deaths and recoveries per day,</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/eda_corona/rec_died.png" alt="rec died" class="center" /></p>
<p>Again cummulatively,</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/eda_corona/rec_died_cummul.png" alt="rec died cummul" class="center" /></p>
<p>Since, the Philippines is still at the onset of its experience with the corona crisis relative to China and South Korea, the number of deaths per day are more significant compared to the number of recoveries per day. This is expected as this pattern was also evident to the case of China and South Korea. We should expect however the effect of the enhanced community quarantine to reflect on the numbers after one month.</p>
<p>Given this scenario, the focus of the officials should be already on containing those who tested positive and empowering frontline workers by providing support and resources. We see some evidence as well that the enhance community quarantine has already reduced the number of cases in the PH. Also we need to do more tests!</p>
<h3 id="whats-next">What’s next?</h3>
<p>We just imported and took a look at what’s in the PH COVID-19 data set using python. That was a quick tour of the basic EDA process that we do in Data Science. I strongly recommend you peruse the documentation of the libraries we used, learn about all of the other handy functions that these tool provides, and explore further questions that you have for the data set.</p>
<p>You can also apply this process to other future projects that you plan to do.</p>
<h3 id="call-to-action">Call to Action</h3>
<p>If you want to help out or want to access some of the efforts using tech and data science to help in the corona crisis in the Philippines, check this out:</p>
<ul>
<li><a href="https://www.facebook.com/groups/1321659434692279">#LockdownLab</a></li>
</ul>
<blockquote>
A think tank for open source solutions and hacks to help each other survive the current pandemic Repository of resources and survival tips for COVID-19 lockdown
Like-minded friends only. Absolutely no trolling & wise-ass remarks.
Propose and share your projects and helpful resources. Support all your posts with photos, videos and/ links if possible.
</blockquote>
<ul>
<li><a href="https://www.facebook.com/groups/246781543171962">Let’s Help FRONTLINERS (COVID19 PH)</a></li>
</ul>
<blockquote>
Let's connect the FRONTLINERS to VOLUNTEERS and DONORS.
Please SHARE this group so we can send help to those in need.
</blockquote>
<ul>
<li><a href="https://bukasba.com/">Bukas Ba?</a></li>
</ul>
<blockquote>
Let’s help each other know which businesses are open* within our local communities during the quarantine. An establishment is open (bukas) if it is operational during the quarantine. This does not pertain to opening hours.
</blockquote>
<ul>
<li>
<table>
<tbody>
<tr>
<td><a href="https://ncovph.com/">API For NCOVPH DATA</a></td>
<td> </td>
<td><a href="https://ncovtracker.doh.gov.ph/">Dashboard</a></td>
</tr>
</tbody>
</table>
</li>
</ul>
<blockquote>
With its corresponding dashboard from DOH.
</blockquote>
<ul>
<li><a href="https://www.mapcontrib.xyz/t/533a05-NCR_Quarantine_Essentials_Map">NCR Quarantine Essentials Map</a></li>
</ul>
<blockquote>
Produced by Grab Philippines to help citizens locate stores for food, essentials, and services.
</blockquote>
<ul>
<li><a href="https://saanyan.github.io/saanmaynagdedeliver/">Saan Yan</a></li>
</ul>
<blockquote>
Locating what establishements have delivery services.
</blockquote>
<ul>
<li><a href="https://www.facebook.com/hospitalbayanihan/">Hospital Needs Tracker</a></li>
</ul>
<blockquote>
List of hospitals and their calls for donations.
</blockquote>
<ul>
<li><a href="https://storefinder.ph">storefinder.ph</a></li>
</ul>
<blockquote>
Look for nearby stores, pharmacies, banks, and remittance centers remain operational during the Covid-19 Enhanced Community Quarantine period.
</blockquote>
<ul>
<li><a href="https://ambag.me/">ambag.me</a></li>
</ul>
<blockquote>
Help out daily wage workers affected by the COVID-19 quarantine.
</blockquote>
<ul>
<li>[Project Moses PH]](http://www.projectmoses.ph/apps)</li>
</ul>
<blockquote>
Portal for latest information regarding COVID-19 in the Philippines
</blockquote>
<ul>
<li><a href="covid19phstatus.cp-union.com">COVID19 PH REPORT</a></li>
</ul>
<blockquote>
In response to the ongoing public health, transportation and food crises, the Computer Professionals' Union is opening to the public the COVID19 PH REPORT, a web-based geographical information system application. Anyone may report on community initiatives and the National Government's and LGUs' implementations of policies to address the aforementioned crises, as well as any incident related to the enhanced community quarantine.
We hope to gather as much information through this tool to sufficiently be able to provide a situationer which the general public can use to make informed decisions and actions in these trying times.
</blockquote>
<p>Questions? Contact me via <a href="https://ph.linkedin.com/in/albertyumol">LinkedIn</a>. I’m also on GitHub with the username <a href="https://github.com/albertyumol">albertyumol</a>.</p>
<script async="" src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
<script>
(adsbygoogle = window.adsbygoogle || []).push({
google_ad_client: "ca-pub-6410209740119334",
enable_page_level_ads: true
});
</script>
<div class="fb-comments" data-href="https://albertyumol.github.io/" data-numposts="5"></div>Albert YumolData science is a very powerful tool that can guide our understanding of real world events.The beautiful world of random numbers2019-11-22T00:00:00+00:002019-11-22T00:00:00+00:00https://albertyumol.github.io//rng<div id="fb-root"></div>
<script async="" defer="" src="https://connect.facebook.net/en_US/sdk.js#xfbml=1&version=v3.2"></script>
<p>All of us encountered chance in one way or another. That guilty moment when guessing in a multiple-choice exam, that heart-throbbing suspense when hoping that you draw your crush’s name in the class Christmas exchange gift, that spin the bottle game of truth or dare when drinking with friends, or even
that simple coin toss whether to eat out or not. Chance is entangled to our lives but ironically, most of
us find it hard to grasp or understand [1].</p>
<p>First time students of statistics and even novice teachers admit that statistics is a difficult field of study as it go against common intuition [2]. In real life situations, most people feel uncomfortable when confronted with the concept of randomness. There is this tendency to believe that things happen for a reason and that there must be an explanation for the present state of things. These sentiments can be traced back to the earliest writings and theological teachings of almost all civilizations [3].</p>
<p>Operational mathematics, in its manner and form, is largely based from the axioms, theorems and propositions of the ancient Greeks. They did not pioneer rigid mathematical treatment of the theory of randomness because; First, they believe that the future is governed by the will of gods. Second, they believe on absolute truth proved by logical assumptions. And third, their number system is impractical for calculations with fractions and non-positive numbers. These would imply that probabilities and chance would be irrelevant and an impossibility for them [4].</p>
<p>The first advancement in the theory of randomness came much later when a medical doctor turned gambler Gerolamo Cardano wrote a book titled ‘Book on the Games of Chance’ that depicts strategies in playing card games, backgammon, and a primitive dice called the astragali [4]. This work by Cardano became the basis of the works of Galileo Galilei (equiprobability of tossing three dice), Blaise Pascal (fundamental counting theory and concept of mathematical expectation) and Jacob Bernoulli (law of large numbers) [1].</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/cropped.png" alt="rng" class="center" /></p>
<p>Another notable contribution to the theory if randomness was from an Englishman shopkeeper by the name of John Graunt. In his book ‘Natural and Political Observations Made Upon The Bills of Mortality’, he was able to predict accurately the mortality rates of London in a ‘normal’ year and also the number of casualties of a particular disease or incident [3].</p>
<p>From these insights in gambling and population dynamics, started our ‘confidence’ that even randomness follows certain rules. The theory of randomness peaked interest and became central in the study of thermodynamics and quantum theory.</p>
<p>At the time, early scientists used draw balls from a ‘well-stirred’ urn for their random numbers. Eventually, tables of random numbers generated from census reports where published but where deemed inadequate for large sampling experiments [6]. Over the years, much of the development was focused on producing large scale random numbers and at a faster rate through deterministic mathematical algorithms.</p>
<p>Humanity has been using random numbers since antiquity. In ancient Egypt, chance have been used to divide property, delegate civic responsibilities or privileges, settle disputes among neighbors, device strategies in battle, and drive the play in games [1]. Often they use randomness to ensure fairness, to prevent dissension, and to acquire divine direction.</p>
<p>Modern times have changed a lot, and random numbers are becoming more essential to most of our technologies. Many areas in physics require simulations of natural phenomena and physical systems. For a more realistic rendering, random numbers are often needed. In the field of statistics, random sampling is often necessary to obtain a generalization of the population.</p>
<p>Exhaustive numerical problems often require large amount of random numbers to crunch and brute force all possible solutions. One of such application is the Monte Carlo method which consists of simulation techniques that uses large amount of random numbers [7]. This method has vast applications in calculating definite integrals, systems of equations or mathematical models in higher dimensions.</p>
<p>Current high-definition movies and video games rely mostly on the realistic rendering of characters and scenes. Random numbers are often used to identify parameter values in which the rendering will fit observations in natural systems. One example is the Disney movie ‘Frozen’. The producers based the simulations of the snow dynamics on a hybrid Eulerian/Lagrangian Material Point Method in which simulation parameters like the number of snow particles in a grid were chosen at random.</p>
<p>Complex systems in economics, sociology, and politics often need strategic analysis for decision making. These systems and their analysis can be understood better by using ‘game theory’. Game theory is a congregate of models where a game is any competitive activity in which players contend with each other according to a set of rules [8]. Examples include political candidates vying for votes, court judges deciding on a case, animals fighting for a prey etc. Random numbers are central to the simulations involved in game theory for it is necessary to exhaust all possible outcomes in a game and optimize the decision process.</p>
<p>Computer algorithms often require random numbers to test its robustness. Of particular interest is its application in cryptography which aims to ensure the integrity and security of data. Branching to the applications of cryptography, modern ideas on innovating monetary and banking systems are starting to take place. Cryptocurrency has become of particular interest for the decentralization of banks and other financial institutions. The process involves person to person transactions which are recorded via a public ledger monitored by individuals called ‘miners’ [9]. As a result transaction fees and product prices becomes lower and since the currency transcends national boundaries, it can be used everywhere. Of particular interest to the implementation of cryptocurrencies would include environmental and sustainability factors as there will be a reduction of our need of paper bills and metallic coins.</p>
<p>For a start, there is no such thing as a random number [6]. Rather we think of random numbers as a ‘sequence’ and not just a single number. Whether a number belongs to a random sequence is entirely dependent to the previous and the next numbers. Generating random numbers is no easy task [10]. Generally, random number generators are classified based on how they are generated. The types include, Pseudo Random Number Generators (PRNGs), Cryptographically Secure Random Number Generators (CSRNGs) and True Random Number Generators (TRNGs).</p>
<p>If the random number generator looks random and passes many standardized statistical tests of randomness, then it can be classified as a PRNG. This type of generator rely on mathematical algorithms to generate random numbers. The numbers produced possess the same statistical properties of ‘real’ random numbers. PRNGs are characterized by being deterministic as they are governed by a particular function. PRNGs are also periodic since they use bounded memories. Applications that uses PRNGs almost always require algorithms that produces random numbers of long periods. Some even uses \(2^{256}\) or higher [11]. PRNGs are useful for simulations but definitely not for cryptographic applications [12].</p>
<p>If a random number generator produces random numbers that looks random and is unpredictable, then it is classified a CSRNG. Basically, it is like the PRNGs that use deterministic algorithm to generate numbers but does not leak any information about the next number [11]. Examples include the CSPRNGs of Unix-based operating systems (OS) called block devices: <strong>/dev/random</strong> and <strong>/dev/urandom</strong>. These two are system files where the operating system pools random OS-dependent events such as system clock, elapsed time between keystrokes or mouse movement, and other physical noise sources [10].</p>
<p>Two operations are used in seeding these CSRNGs. In <em>/dev/random</em>, the operation is called ‘blocking’, where the process blocks an application until sufficient entropy is reached. Entropy in this system is defined as the random bits stored in a file. The pool of bits in this file is called entropy pool and are extracted from physical processes available to the software like mouse movements and keyboard inputs. On the other hand, <em>/dev/urandom</em> makes work whatever is available in the OS’s entropy pool without blocking [13].</p>
<p>The last type of random number generators is the TRNG which looks random, is unpredictable and cannot be reliably reproduced. These type of generators are mainly based from the digitization of physical processes. Such physical processes and sources include: (1) elapsed time between emissions of particles during radioactive decay; (2) thermal noise from semiconductor diode or resistor; (3) the frequency instability of a free running oscillator; (4) the amount a metal insulator semiconductor capacitor is charged during a fixed period of time; (5) air turbulence within a sealed disk drive which causes random fluctuations in disk drive sector read latency times; and; (6) sound from a microphone or video input from a camera.</p>
<p>Generally, TRNGs are the best generators for applications in security and encryptions as they are non-deterministic [12]. One point of concern with TRNGs is that they are inherently biased. These biases are due to our inability to device balanced physical processes. These biases can be removed by using post-processing algorithms that balance out the bits.</p>
<p>There are also free services on the web that offers a finite sequences of random numbers. One particular and useful website is random.org that obtains their data from atmospheric noise [14] and HotBits that uses the radioactive decay of Cesium atoms [15].</p>
<p>Random numbers generated from physical processes are vital our current mode of communication that relies on the transmission of data through computer systems and networks. Our daily activities like sending emails, browsing social media, online banking and shopping depend on these telecommunication platforms and the world wide web secured by random numbers.</p>
<p>Want to collaborate? Message me in <a href="https://ph.linkedin.com/in/albertyumol">LinkedIn</a>.</p>
<p>References:</p>
<p>[1] D. Bennett. Randomness. 1998. Harvard University Press.</p>
<p>[2] J. Piaget and B. Inhelder. The Origin of the Idea of Chance in Children. 1975. W. W. Norton</p>
<p>[3] G. Markowsky. The Sad History of Random Bits. Journal of Cyber Security. 2014. River Publishers</p>
<p>[4] L. Mlodinow. The Drunkard’s Walk: How Randomness Rules Our Lives. 2008. Pantheon</p>
<p>[5] J. D. Norton. <a href="http://www.pitt.edu/~pittcntr/Being_here/last_donut/donut_2014-15/02-27-15_dice.html">Dice</a></p>
<p>[6] D.E. Knuth. Art of Computer Programming, Volume 2: Seminumerical
Algorithms. 1968. Addison-Wesley Publishing Company, Inc.</p>
<p>[7] J. Gentle. Random Number Generation and Monte Carlo Methods. 2002. Springer-Verlag New York</p>
<p>[8] M. J. Osborne. An Introduction to Game Theory. 2003. Oxford University Press</p>
<p>[9] Bitcoin <a href="http://www.bitcoin.org">bitcoin.org</a></p>
<p>[10] A. Menezes P. van Oorschot and S. Vanstone. Handbook of Applied Cryptography. 1997. CRC Press</p>
<p>[11] B. Schneier. Applied Cryptography, Second Edition: Protocols, Algorthms, and Source Code in C. 1994. Wiley</p>
<p>[12] A. McAndrew. Introduction to Cryptography with Open-Source Software. 2011. CRC Press</p>
<p>[13] M. Sheth.<a href="https://www.veracode.com/blog/research/cryptographically-secure-pseudo-random-number-generator-csprng">Cryptographically Secure Pseudo-Random Number Generator (CSPRNG)</a></p>
<p>[14] <a href="https://www.random.org/">random.org</a></p>
<p>[15] J. Walker. <a href="https://www.fourmilab.ch/hotbits/">HotBits: Genuine random numbers, generated by radioactive decay</a></p>
<script async="" src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
<script>
(adsbygoogle = window.adsbygoogle || []).push({
google_ad_client: "ca-pub-6410209740119334",
enable_page_level_ads: true
});
</script>
<div class="fb-comments" data-href="https://albertyumol.github.io/" data-numposts="5"></div>Albert YumolRandom numbers generated from physical processes are vital our current mode of communication that relies on the transmission of data through computer systems and networks. Our daily activities like sending emails, browsing social media, online banking and shopping depend on these telecommunication platforms and the world wide web secured by random numbers.[Developing] Lecture Notes on Time Series Analysis2019-11-03T00:00:00+00:002019-11-03T00:00:00+00:00https://albertyumol.github.io//tsa<div id="fb-root"></div>
<script async="" defer="" src="https://connect.facebook.net/en_US/sdk.js#xfbml=1&version=v3.2"></script>
<p>Note: This is assuming the reader already knows basic statistics.</p>
<p><strong>Time Series</strong> is a collection of observations made sequentially through time.</p>
<p><strong>Definition of Terms</strong></p>
<p><strong>continuous time series</strong>
are time series whose observations are made continuously through time. This type can also be used when the measured variable can only take a discrete set of values.</p>
<p><strong>discrete time series</strong>
are time series whose observations are taken only at specific times, usually equally spaced. This type can also be used when the measured variable is a continuous variable.</p>
<p><strong>point process</strong>
are series of events occuring ‘randomly’ through time.</p>
<p><strong>sampled time series</strong>
are digitized version of continuous time series sampled at equal intervals of time to give a discrete time series.</p>
<p>When successive observations are dependent, future values may be predicted from past observations.</p>
<p><strong>deterministic time series</strong>
are time series that can be <em>predicted exactly</em> from precious observations.</p>
<p><strong>stochastic time series</strong>
are time series that can be <em>predicted only partly</em> from precious observations. This entails thinking of future values as having a probability distribution.</p>
<p><strong>outliers</strong>
observations that do not appear to be consistent with the rest of the data. (may be valid or just a freak observation from data acquisition mishap)</p>
<p><strong>robust</strong>
insensitivity to outliers.</p>
<p><strong>objective of time series analysis</strong></p>
<ul>
<li>description
<ul>
<li><strong>Time plot</strong> observations against time. First step in analyzing a time series to obtain a simple descriptive measures of the main properties of the series.</li>
</ul>
</li>
<li>explanation
<ul>
<li><strong>Linear system</strong> converts an input series into an output series by a linear operation.</li>
</ul>
</li>
<li>prediction
<ul>
<li>used to describe SUBJECTIVE methods in inferring possible future values</li>
<li><em>forecasting</em> on the other hand is used to describe OBJCTIVE methods.</li>
</ul>
</li>
<li>control
<ul>
<li>improve or govern some physical or economic system such as keeping a power plant operation process at a high level or on statistical modelling working out an optimal control strategy.</li>
</ul>
</li>
</ul>
<p><strong>Usual Techniques used</strong></p>
<ul>
<li>Simple Descriptive techniques</li>
<li>Autocorrelation</li>
<li>Analysis in the time domain</li>
<li>Spectral Density Function</li>
<li>Linear Systems</li>
<li>State-Space Models</li>
<li>Kalman Filter</li>
<li>Nonlinear and Multivariate Time Series Models</li>
</ul>
<p><strong>Contents</strong></p>
<ul>
<li>Simple Descriptive Techniques</li>
<li>Some Time Series Models</li>
<li>Fitting Time Series Models in Time Domain</li>
<li>Forecasting</li>
<li>Stationary processes in the Frequency Domain</li>
<li>Spectral Analysis</li>
<li>Bivariate Processes</li>
<li>Linear Systems</li>
<li>State-Space Models and the Kalman Filter</li>
<li>Nonlinear Models</li>
<li>Multivariate Time Series Modelling</li>
<li>Fourier, Laplace and z-Transforms</li>
<li>My Beloved Dirac delta <3</li>
</ul>
<p><strong>Simple Descriptive techniques</strong>
Typical surface level statistics methods in analyzing the data include getting the mean, median, or mode and the standard deviation to quantify location and dispersion of the data set. <em>Time series analysis is different</em>.</p>
<p>The previously mentioned summary statistics can be misleading when the time series contains trend, seasonality, inherent systematic components, and correlations to observables.</p>
<p><strong>Types of Variations</strong></p>
<ul>
<li>Seasonal Variations</li>
<li>Other Cyclic Variations</li>
<li>Trend</li>
<li>Other Irregular Fluctuations</li>
</ul>
<p><strong>Types of Variations</strong></p>
<ul>
<li>Stationary Time Series</li>
</ul>
<p><strong>The Time Plot</strong></p>
<p><strong>Transformations</strong></p>
<ul>
<li>To stabilize the variance.</li>
<li>To make the seasonal effect additive.</li>
<li>To make the data normally distributed.</li>
</ul>
<p>to be continued :)</p>
<p>sample on ARIMA principles, derivations and applications.</p>
<p>focus
on me
keep up self</p>
<p>Want to collaborate? Message me in <a href="https://ph.linkedin.com/in/albertyumol">LinkedIn</a>.</p>
<script async="" src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
<script>
(adsbygoogle = window.adsbygoogle || []).push({
google_ad_client: "ca-pub-6410209740119334",
enable_page_level_ads: true
});
</script>
<div class="fb-comments" data-href="https://albertyumol.github.io/" data-numposts="5"></div>Albert YumolUpskilling with An Introduction to The Analysis of Time Series by Chris Chatfield.From Paint to PhotoShop real quick: Generating photo-realistic images using semantic image synthesis2019-11-03T00:00:00+00:002019-11-03T00:00:00+00:00https://albertyumol.github.io//ip<div id="fb-root"></div>
<script async="" defer="" src="https://connect.facebook.net/en_US/sdk.js#xfbml=1&version=v3.2"></script>
<p>Have you heard the latest buzz from NVDIA? Their engineers where able to create an algorithm that converts semantic images, the ones that you doodle in MS Paint as a toddler, to very realistic images.</p>
<p>Just look at this animation:</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/ip/gaugan.gif" alt="Gaugan." class="center" /></p>
<p>I was not born with artistically coordinated hands. I can’t even draw passable stick figures. My right hand sketches are practically as bad as my left. Having a physics background only worsened this ‘skill’ as I was trained to assume and over simplify 3D moving objects as 1D dots and particles connected with hasty crooked arrows. I find art fun but I guess it’s not mutual.</p>
<p>Now, with the advent of deep learning and machine learning in image processing, we can all be the Van Goghs and Picassos that we dreamed of.</p>
<p>To start, here is a <a href="https://arxiv.org/abs/1903.07291">link</a> of their paper if you don’t want any spoilers.</p>
<p>This paper is state-of-the-art in image segmentation and style transfer as of writing so brace yourselves with some maths and ramp up your <em>machine-learning curves</em>.</p>
<p><strong>Image Processing Basics</strong></p>
<p>There are two main things that you can do with image processing: <em>classification</em> and <em>generation</em>. Image classification involves various algorithms that categorizes an existing image into a particular class.</p>
<p>Like this Meme:</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/ip/meme.jpg" alt="meme" class="center" /></p>
<p>Visually an image classifier algorithm looks like this:</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/ip/image_classifier.jpg" alt="meme" class="center" /></p>
<p>Basically, you have an image with <strong>3 channels</strong> (red, green and blue) and height-width dimension. You apply image decomposition techniques and output a value that you use to classify an image (cat or dog based from the meme above).</p>
<p>On the other hand, image generation is when you create new images from certain inputs. This time around you make use of that little information to reconstruct an image or generate an entirely new image.</p>
<p>The main imaging technique to achieve both these processes is called convoltion. In simplest terms, convolution is just multiplication in frequency space. Read my <a href="https://albertyumol.wixsite.com/bash/activity-4">post</a> on Fourier Transforms, Convolution, and Image Formation.</p>
<p>Another term for image classifiers are <em>discriminators</em>. A convolution operation reduces the dimension of features based from a pixel window:</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/ip/convolution.gif" alt="meme" class="center" /></p>
<p>Image generation works with the same principle but done in reverse.</p>
<p>As a baseline to measure the performance, an image classifier can be cross-validated with a ground truth. The problem is the baseline for testing the performance of an image generator. This problem is addressed by a type of network algorithm called <em>Generative Adversarial Network</em> (GAN).</p>
<p>Adversarial because you will train to models that will compete with each other. The generator produces fake images that will be fed as one of the inputs to the discriminator. The discriminator will then try to identify whether the input image is real or fake.</p>
<p>To train the models and make it adaptively learn, we use a loss function given by:</p>
\[\begin{equation}
\mathcal{L}_\text{GAN}(D, G) = \underbrace{E_{\vec{x} \sim p_\text{data}}[\log D(\vec{x})]}_{\text{accuracy on real images}} + \underbrace{E_{\vec{z} \sim \mathcal{N}}[\log (1 - D(G(\vec{z}))]}_{\text{accuracy on fakes}}
\end{equation}\]
<p>Consider \(x\) is the input image. If the discriminator \(D\) is accurate in identifying real images, then first term on the right hand side of the equation should be low. If \(D\) correctly identifies fake images, the expectation value of the last term should also be low.</p>
<p>The goal is to diminish the value of the loss function to increase the competence of the models.</p>
<p>GANs are amazing in its job for this type of image processing. However, they are sensitive to noise thus are very hard to train. For example if \(D\) is almost always misclassifying the fake images as real, then the model has the tendency to produce a single image that is lossy and unrealistic. Experts call this the <em>collapse</em>.</p>
<p>To solve this dilemma, previous studies have focused on modifying the loss function and and gradually increasing the image resolution each training step.</p>
<p>GANs can also be expanded to image-to-image translation. That is instead of a random input vector for the generator, the input can be an image, say for example a semantic image and generate a synthetic image. The discriminator on the other hand compares the semantic image and the original image and decide whether its real or not.</p>
<p>The problem with the current GAN setup is that the convolution operation across the training steps and layers combine only patches that are small relative to the usual training images making too slow and unreliable still.</p>
<p>One not-to-old method to solve this is the <strong>pix2pixHD</strong> algorithm. It utilizes skip connections by training portions of the network with lower resolution images, appending it other layers and trained again on higher resolution images.</p>
<p>Going back to the previous GANs, most of them uses <strong>batch normalization</strong> to boost speed and training stability. This is done with a batch of images through this equation:</p>
\[y = \frac{x - \mathrm{E}[x]}{\sqrt{\mathrm{Var}[x] + \epsilon}} * \gamma + \beta\]
<p>Essentially, the mean of the features are translated to 0 and the standard deviation to 1. However, this method is still gruesome because each layer of the network should be adapting to the new parameter values.</p>
<p>The pix2pixHD resolves this by implementing <strong>instance normalization</strong> which normalizes over each image separately reducing training time.</p>
<p>pix2pixHD performs well except when it doesn’t. Case in point if there is only a single class in the semantic image. Since it is doing an instance normalization with identical values for each channel, convolution operation will flatten out the features ultimately throwing away information from the semantic image.</p>
<p>This is where SPADES comes in. SPADES stands up for spatially-adaptive [de]normalization. It augments the pix2pix framework by preventing the loss semantic information. This is done by using downsampled versions of the semantic image to modulate the normalized outputs of each training layer.</p>
<p>Now can revert back to the original GAN implementation with only the random vector as input. Since the semantic image is integrated to the network itself, we have the ability to input as many image as we can.</p>
<p>Lastly, if we have a another input image that we can train, pass through an encoder, and spit out generator vectors, the SPADE will also reproduce an image copying the style of the encoded input image.</p>
<p>Here are some results:</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/ip/spade_style_transfer.jpg" alt="meme" class="center" /></p>
<p>Just imagine the possibilities that we can do with this synthetic image generation. Maybe we can even create our own high-definition movies from rough sketches come the future.</p>
<p>Want to collaborate? Message me in <a href="https://ph.linkedin.com/in/albertyumol">LinkedIn</a>.</p>
<p>References:</p>
<p>[1] T. Park, etal. <a href="https://arxiv.org/abs/1903.07291">Semantic Image Synthesis with Spatially Adaptive Normalization.</a></p>
<p>[2] A. King. <a href="https://adamdking.com/blog/gaugan/">Photos from Crude Sketches: NVIDIA’s GauGAN Explained Visually.</a></p>
<script async="" src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
<script>
(adsbygoogle = window.adsbygoogle || []).push({
google_ad_client: "ca-pub-6410209740119334",
enable_page_level_ads: true
});
</script>
<div class="fb-comments" data-href="https://albertyumol.github.io/" data-numposts="5"></div>Albert YumolStyle transfer and realistic image rendering using Neural Networks.[Developing] Basic Statistics Review for Data Science2019-10-13T00:00:00+00:002019-10-13T00:00:00+00:00https://albertyumol.github.io//stats<div id="fb-root"></div>
<script async="" defer="" src="https://connect.facebook.net/en_US/sdk.js#xfbml=1&version=v3.2"></script>
<p>This is a draft for <em>basic statistics module for data scientists</em> developed for <strong>Eskwelabs</strong>. The module assumes that the reader knows some basic Pythons and survival algebra.</p>
<hr />
<p><b> Outline </b></p>
<ul>
<li>Measures of Central Tendencies</li>
<li>Measures of Dispersion</li>
<li>Covariance, Correlation, and Causation</li>
<li>Probability
<ul>
<li>Dependence, Independence, and Conditional Probability</li>
<li>Bayes’ Theorem</li>
<li>Random Variables</li>
<li>Continuous distributions</li>
<li>Normal distribution</li>
<li>Central Limit Theorem</li>
</ul>
</li>
<li>Hypothesis and interference
<ul>
<li>Confidence interval</li>
<li>Bayesian inference</li>
</ul>
</li>
</ul>
<p>In the recent decade, we have produced majority of our data due to the increase in computing powers of our machines, increase in volume of our storage systems and the decrease in the production cost of technology. These led to the accessibility of technology to a lot more people.</p>
<p>But data is just that - data. It doesn’t really tell us much. To get insights from it, we need to process it.</p>
<p>Take as an example the qualifying entrance exam for <strong>Eskwelabs</strong>. Assume that there are 50 items for the exam with 100 examinees. You have already learned by now how to generate synthetic data in Python. Now we generate the scores randomly and visualize using a bar plot hoping that no one scores below 50.</p>
<p>Here is how you do it in Python:</p>
<script src="https://gist.github.com/albertyumol/95649098fa8c08dd58ce5e966b57b886.js"></script>
<p>And here is the resulting distribution:</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/stat/eskwelabs_score.png" alt="Eskwelabs scores." class="center" /></p>
<p>Visualizing can give us a feel of what the data looks like. We can see that one student actually perfected the exam (<strong>maximum</strong>) and the lowest score is around 25 (<strong>minimum</strong>). From these we can tell that the scores <strong>ranges</strong> from 25-50.</p>
<p>Most of the time, visualizing data is not enough to get insights. That is where statistics comes in.</p>
<p>[Statistics][1] refers to the mathematical methods and techniques to gain insights from data. I know you already did some statistics before. So this should be easy breezy.</p>
<p>The three most common tool used in data description is what we call the measures of central tendency.</p>
<h1>Measures of Central Tendencies</h1>
<p>In exploring our data sets, we want to check the average or central values of our data. Centrality can be measured by these 3 values: mean, median, and mode.</p>
<p><strong>Mean</strong> <br /></p>
<p>The mean gives you an initial summary of your data set. It is obtained by getting the sum of the individual data points \(x_{i}\) and divide this sum by the total number of data points \(N\).</p>
<p>Mathematically:</p>
\[\begin{equation}
\overline x = \sum_{i}^{N} \frac {x_i}{N}
\end{equation}\]
<p>Programmatically in Python:</p>
<script src="https://gist.github.com/albertyumol/a532da3d311d913247111c2485096231.js"></script>
<p>Calculation the mean exam scores of Eskwelabs applicants, we find that the average score is \(36.9\) which is \(73.8%\). Not bad if the exam is difficult. This value somewhat gives us an idea of how the examinees performed overall in the exam.</p>
<p>Another measure of central tendency is the median.</p>
<p><strong>Median</strong> <br /></p>
<p>The median is the middle value of the data set when you sort out every single points from lowest to highest. The most common approach in obtaining the median value is by considering two cases: when the length of data set \(N\) is odd or even.</p>
<p>When \(N\) is odd, it is not divisible by two thus we obtain a unique middle value.</p>
<p>Mathematically:</p>
\[\begin{equation}
M_{odd} = \left( \frac {N+1}{N} \right) ^{th} term
\end{equation}\]
<p>The other case is when \(N\) is even which implies that there are two candidate values for the median. What you do is get the average of these two points.</p>
<p>Mathematically:</p>
\[\begin{equation}
M_{even} = \frac {\left( \frac {N}{2} \right) ^{th} term + \left( \frac {N}{2} + 1 \right) ^{th} term}{2}
\end{equation}\]
<p>Implementing both these equations in Python:</p>
<script src="https://gist.github.com/albertyumol/15011622fbd3747ae541f45b5814b002.js"></script>
<p>If you are keen enough to notice, in our mathematical equation for the even case we add 1 to the middle value but when we implemented in code we change it by subtracting one. This is because in Python, we start counting from 0 so we need to account for this translation.</p>
<p>The median score for the Eskwelabs exam data set is 37 which is not far from the mean. As we have observed, the median is a bit more complex to calculate because you need to sort first and find the middle value. There are other approaches for median calculation without the tedious sorting. See this link for exploration: <a href="https://medium.com/@nxtchg/calculating-median-without-sorting-eaa639cedb9f">Sorting Less Median Calculation</a>.</p>
<p>Although the mean is easier and faster to calculate, it is very sensitive to outliers unlike the median. So if your analysis requires less sensitivity to noise, median is the better descriptor.</p>
<p>As a side note, the median is a special type of measurement called <strong>quantiles</strong>. These refers to the partitioning of the data set in equal proportions. The median is actually a second order quantile because it divides the data set into two. Here is the general code to calculate the quantile based on order (or number of partitions):</p>
<script src="https://gist.github.com/albertyumol/cb1434ebab9db42a6d4cdab472d2f43e.js"></script>
<p>The last measure of central tendency is the mode.</p>
<p><strong>Mode</strong> <br /></p>
<p>The mode is a measure of frequency. It gives you the most frequent value in your data set. You can use this to check the balance of your data set if it is biased for particular values. It can have more than one value if there is a tie for the maximum number of counts. The mode requires no formula since it is just the most frequent data point. One cool approximation for the mode by Professor Karl Pearson is given by the empirical formula:</p>
\[Mode = 3(Median) - 2(Mean)\]
<script src="https://gist.github.com/albertyumol/2d58c12acc98fcc7c5242c05377805c8.js"></script>
<p>For the Eskwelabs exam problem, the mode is:</p>
<h1>Measures of Dispersion</h1>
<p>Dispersion is the measure of the spread. It tells you the difference in values across all data points. A small dispersion value indicates the data points are near each other while a large dispersion means data points are far apart.</p>
<p>Basic dispersion measures include the range, variance, standard deviation, and interquartile range.</p>
<p>The range is self explanatory. It provides you the upper and lower bound of your data set. Mathematically:</p>
\[Range = x_{max} - x_{min}\]
<p>In Python,</p>
<script src="https://gist.github.com/albertyumol/dc373d15b5c8f4105e7b6a4b411c08f3.js"></script>
<p>The problem with the range is that it is only describing the end points of the data set. It doesn’t provides us insights on the whole data set between in. To address this, we consider the next metric called variance.</p>
<p>Variance measure how far the data points are spread from the mean. Mathematically:</p>
\[\sigma ^{2} = \sum \frac {x_i - \overline x}{N}\]
<p>where \(x_i\) is a single data point, \(\overline x\) is the mean, and \(N\) the total number of data points.</p>
<p>To do it in Python,</p>
<script src="https://gist.github.com/albertyumol/97e6a6db5b2306aa5e56494a5bd2ee9f.js"></script>
<p>The variance will help us determine the size of the data spread but the problem is the units. The units for the measures of central tendencies and range are all the same but variance is in squared units. It will make more sense if we get the square root variance (same unit with the other measures) and we call it the standard deviation given by</p>
\[\sigma = \sqrt {\sigma ^{2}} = \sqrt {\sum \frac {x_i - \overline x}{N}}\]
<p>In Python,</p>
<script src="https://gist.github.com/albertyumol/88ea806ad551a42562a80bb159db6ce2.js"></script>
<p>The standard deviation measures the absolute variability of the dispersion with respect to the mean. However, like the range it is also very susceptible to outliers. Depending on applications, a better metric would be the interquartile range. This can be calculated from the difference of \(75^{th}\) and \(25^{th}\) percentile value. In python,</p>
<script src="https://gist.github.com/albertyumol/775f04f1fc5fc71b0d8009bbffd0396f.js"></script>
<h1>Covariance, Correlation, and Causation</h1>
<p>Statistics has also some philosophical roots. We often hear the phrase, <em>‘correlation is not causation’</em>. Use correlation with a grain of salt. You may be mislead by simplified assumptions and confounding variables.</p>
<p>For example, if there exists a strong correlation between variables x and y, we conclude:</p>
<ul>
<li>x causes y</li>
<li>y causes x</li>
<li>both causes each other</li>
<li>neither causes each other</li>
</ul>
<p>Recall that the variance is a measure of how a single variable deviates from the mean, a <strong>covariance</strong> measures the variation of two variables with respect to each other and their means. Mathematically:</p>
\[\sigma_{xy} = \frac {1}{N} \sum_{i} (x_i - \overline x)(y_i - \overline y)\]
<p>In Python,</p>
<script src="https://gist.github.com/albertyumol/a135760cb2f1f4507045e49c4b67470a.js"></script>
<p>The multiplication factor from the equation above is a dot product of two variables. A significantly positive covariance indicates that x is large when y is large and vice versa. A significantly negative covariance indicates that x is large when y is small and vice versa. A zero covariance indicates that there is no relation between the two variables.</p>
<p>We are now faced with the problem of interpreting the variance. Typically it is difficult to set how big a number is to be significant relative to other values. Another problem is the units, in this example x-units-y-units-per-unit-time. That is why we transform the covariance by normalizing it with the standard deviations of both variables \(x\) and \(y\). We call this transformed metric the correlation \(\rho_{xy}\) mathematically as:</p>
\[\rho_{xy} = \frac {\sigma_{xy}}{\sigma_{x} \sigma_{y}}\]
<p>In Python,</p>
<script src="https://gist.github.com/albertyumol/6310d210258e196a12ea8a2642e419d8.js"></script>
<p>This value is unitless (since we normalized) and ranges from [-1, 1]. We interpret values as:</p>
<ul>
<li>
<p>1 : perfect correlation</p>
</li>
<li>
<p>0 : no correlation</p>
</li>
<li>
<p>-1 : perfect anti-correlation</p>
</li>
</ul>
<p>Exercises:</p>
<p>(1) Generate two random samples: x ~ N(1.78,0.1) and y ~ N(1.66,0.1).</p>
<p>(2) Compute \(\overline x\), \(\sigma_{x}\), \(\sigma_{y}\), \(\sigma_{xy}\), \(\rho_{xy}\) from scratch (without using any module e.g. numpy, statsmodels). Compare the results with the ones from the numpy library and state your observations.</p>
<p>For further reading:</p>
<p>Python related:</p>
<ul>
<li><a href="https://www.scipy.org/">SciPy</a></li>
<li><a href="https://pandas.pydata.org/">pandas</a></li>
<li><a href="https://www.statsmodels.org/">StatsModels</a></li>
</ul>
<p>Statistics Books and references:</p>
<ul>
<li><a href="https://www.openintro.org/stat/textbook.php">OpenIntro Statistics</a></li>
<li><a href="https://openstax.org/details/introductory-statistics">OpenStax Introductory Statistics</a></li>
</ul>
<h1>Probability</h1>
<p>Probably one of the hardest concept to grasp in statistics is probabilities. As humans we feel discomfort when confronted with uncertainty.</p>
<p>But for this brief tutorial we will not be dealing with the philosophical interpretation of probability. Those topics are more suitable for sober drink with geek friends.</p>
<p>Think of probability as our way of quantifying chance or uncertainty. It answers ‘What are the odds that an event \((E)\) will occur given all possible outcomes?’. We denote this statement as the probibility of E occurring: \(P(E)\).</p>
<p><strong>Probability is one of the pillar concepts in Data Science</strong>. This is me implicitly telling you (or obliging you) to seriously appreciate this as we will use it to build models, train models, evaluate models, and a whole lot more.</p>
<h2>Dependence, Independence and Conditional Probability</h2>
<p>Consider two events \(A\) and \(B\). When knowing whether \(A\) happens gives us no information if \(B\) happens, we say that \(A\) and \(B\) are <em>independent</em>.</p>
<p>Mathematically,</p>
\[P(A,B) = P(A)P(B)\]
<p>On the contrary, if knowing event \(A\) adds information if \(B\) occurs, we say that \(A\) and \(B\) are <em>dependent</em>.</p>
<p>Mathematically,</p>
\[P(A | B) = \frac {P(A,B)}{P(B)}\]
<p>This is the definition of the <strong>conditional probability</strong>. We read this equation as the probability of the event \(B\) will occur given the knowledge that event \(A\) has already occurred. Or simply: probability of \(B\) given \(A\).</p>
<p>The conditional probability equation is the generalization of independent and dependent cases. This means that you can apply the equation to both cases.</p>
<p>Take note for this equation that it is only valid when \(P(A) > 0\) (because maths don’t permit division by 0).</p>
<p>The most common example to demonstrate this is the boy-girl problem. Consider a family with two children whose gender (birth) is unknown, the probabilities are as follows:</p>
<ul>
<li>no girl: \(\frac {1}{4}\)</li>
<li>1 girl, 1 boy: \(\frac {1}{2}\)</li>
<li>2 girls: \(\frac {1}{4}\)</li>
</ul>
<p>Note that the sum of the probabilities of all events should be 1.</p>
<p>We assume that the siblings are not twins and the likelihood of a child being a girl and a boy is equal. We also assume that the gender of the second child is independent from the gender of the first child.</p>
<p>Consider two problems:</p>
<p>Problem 1: Find the probability of <em>both children being girls</em> (\(\alpha\)) given that the <em>older child is a girl</em> (\(\beta\)).</p>
<p>Problem 2: Find the probability of <em>both children being girls</em> (\(\alpha\)) given that <em>at least one children is a girl</em> (\(\gamma\)).</p>
<p>We start with all the possible outcomes:</p>
<ul>
<li>BB</li>
<li>BG</li>
<li>GB</li>
<li>GG</li>
</ul>
<p>To approach Problem 1, recall the conditional probability:</p>
\[P(\alpha | \beta) = \frac {P(\alpha,\beta)}{P(\beta)}\]
<p>For this case \(P(\alpha,\beta)\) is just \(P(\alpha)\). Thus,</p>
\[P(\alpha | \beta) = \frac {P(\alpha)}{P(\beta)} = \frac {\frac {1}{4}} {\frac {1}{2}} = \frac {1}{2}\]
<p>For Problem 2, \(P(\alpha,\gamma)\) is also \(P(\alpha)\). Thus</p>
\[P(\alpha | \gamma) = \frac {P(\alpha)}{P(\gamma)} = \frac {\frac {1}{4}} {\frac {3}{4}} = \frac {1}{3}\]
<p>To do this numerically in Python, we can generate one million such families and confirm the probabilities.</p>
<script src="https://gist.github.com/albertyumol/7b27fc2d18ba00bb59d2b84f54dfd431.js"></script>
<p>The result are as follows:</p>
<ul>
<li>Problem 1: 0.4992346188401824 ~ \(\frac {1}{2}\)</li>
<li>Problem 2: 0.3331819838973103 ~ \(\frac {1}{3}\)</li>
</ul>
<h2>Bayes's Theorem</h2>
<p>Recall that conditional probability tells as that event \(\alpha\) occurs given \(\beta\). But what if we want to know the occurrence of \(\beta\) given only the <em>probability</em> \(\alpha\) given \(\beta\).</p>
<p>Let’s do some mathematical derivations :)</p>
<p>Conditional probability states that:</p>
\[P(\alpha | \beta) = \frac {P(\alpha,\beta)}{P(\beta)}\]
<p>getting the inverse:</p>
\[P(\beta | \alpha) = \frac {P(\beta,\alpha)}{P(\alpha)}\]
<p>we get,</p>
\[P(\beta,\alpha) = P(\beta | \alpha)P(\alpha)\]
<p>using commutative property:</p>
\[P(\alpha | \beta) = \frac {P(\beta,\alpha)P{(\alpha)}}{P(\beta)}\]
<p>Next we split \(P(\beta)\) as:</p>
\[P(\beta) = P(\beta, \alpha) + P(\beta, \alpha ^{\dagger})\]
<p>The \(\dagger\) indicates ‘not occurring’.</p>
<p>Substituting the variables and applying the conditional probability equation we get:</p>
\[P(\alpha | \beta) = \frac {P(\beta | \alpha)P(\alpha)}{P(\beta | \alpha)P(\alpha) + P(\beta | \alpha ^{\dagger})P(\alpha ^{\dagger})}\]
<p>Famous example:</p>
<p><em>The medical doctor vs. The Data Scientist</em></p>
<p>Imagine a scenario of true positive test in a data set. Suppose that 1 every \(10,000\) people have a disease. If we can measure this disease with a test that is correct \(99%\) of the time, through the Bayes’s theorem, we can calculate the probability of true positives that is people who was tested positive and actually have the disease.</p>
<p>Assuming that the people took the test at random, let \(d\) me the variable for the disease and \(p\) for postive. If we want to know if you have the disease given that you tested positive,</p>
\[P(d | p) = \frac {P(p | d)P(d)}{P(p | d)P(d) + P(p | d ^{\dagger})P(d ^{\dagger})}\]
<p>We deduce that:</p>
<ul>
<li>\(P(p\|d) = 0.99\) (given that the test is correct 99% of the time)</li>
<li>\(P(p\|d ^{\dagger}) = 0.01\) (probability that a person tested positive but actually doesn’t have the disease)</li>
<li>\(P(d) = 0.0001\) (chance that a person has the disease \(\frac {1}{10000}\))</li>
<li>\(P(d ^{\dagger}) = 0.9999\) (chance that a person does not have the disease)</li>
</ul>
<p>Substituting these values to the previous equation:</p>
\[P(d | p) = \frac {(0.99)(0.0001)}{(0.99)(0.0001) + (0.01)(0.9999)} = 0.00980392156862745\]
<p>This tells us that of the \(1%\) who tested positive only \(0.98%\) of those actually have the disease.</p>
<p>Ref: http://www.stat.yale.edu/Courses/1997-98/101/condprob.htm</p>
<h2>Random Variables</h2>
<p>In data science, we seldom treat random variables as important in our methods. We assume that they are just there without giving them proper credit. I think that looking into random variables will give a more in-depth interpretation of results anchored in probability theory.</p>
<p><a href="https://en.wikipedia.org/wiki/Random_variable">Random variables</a> are any variables whose values depend on outcomes of a random phenomenon like throwing dice or drawing cards.</p>
<p>For a coin toss, the random variables are:</p>
<ul>
<li>1 with probability \(\frac {1}{2}\) if it is a head</li>
<li>0 with probability \(\frac {1}{2}\) if it is a tail</li>
</ul>
<p>Recall our example on the boy girl problem.</p>
<p>Let \(X\) be the random variable representing the number of girls. Then:</p>
<ul>
<li>\(X = 0\) has probability \(\frac {1}{4}\)</li>
<li>\(X = 1\) has probability \(\frac {1}{2}\)</li>
<li>\(X = 2\) has probability \(\frac {1}{4}\)</li>
</ul>
<p>Now let \(Y\) be the random variable representing the number of girls given that the older child is a girl. Then:</p>
<ul>
<li>\(Y = 1\) has probability \(\frac {1}{2}\)</li>
<li>\(Y = 2\) has probability \(\frac {1}{2}\)</li>
</ul>
<p>Now let \(Z\) be the random variable representing the number of girls given that at least one of the children is a girl. Then:</p>
<ul>
<li>\(Z = 1\) has probability \(\frac {2}{3}\)</li>
<li>\(Z = 2\) has probability \(\frac {1}{3}\)</li>
</ul>
<p>It is not super necessary but you will appreciate it more if you properly account (labelling and identifying them) for the random variables in your problem.</p>
<h2>Continuous Distributions</h2>
<p>The world that we live in is a complex world. There are systems that defy gravity, particles that pass through walls, and atoms that teleport.</p>
<p>That’s why it is all fun and exciting.</p>
<p>We will start to integrate a little bit of calculus concepts with statistics as dealing with real world problems require higher math.</p>
<p>A distribution is any function or simply put a list of all possible values or interval of values with their respective frequency in the data set.</p>
<p>You have been familiar with some of the types of distribution. The easiest one is the discrete distribution, meaning values are restricted or <em>quantized</em> to certain numbers. For a example, in a die system the outcomes discrete only 0 and 1.</p>
<ul>
<li>Uniform Distribution</li>
<li>Probability Density Function (pdf)</li>
<li>Cumulative Distribution Function (cdf)</li>
</ul>
<h2>The Normal Distribution</h2>
<p>Definitely my most favorite distribution of all is the normal distribution.</p>
\[f(x | \mu, \sigma) = \frac {1}{\sqrt{2\pi} \sigma} exp \left( - \frac {(x - \mu)^2}{2\sigma^2} \right)\]
<p>Here is the code:</p>
<script src="https://gist.github.com/albertyumol/1ca016de3af5047f24dab392f3b5390f.js"></script>
<p>Here is how it looks like:</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/stat/normal.png" alt="Normal Distributions." class="center" /></p>
<p>Difference between:</p>
<ul>
<li>Normalization</li>
<li>Standardization</li>
</ul>
<h2>The Central Limit Theorem</h2>
<p>The central limit theorem simply put states that:</p>
<blockquote>
<small>
The average of the sample means will always be the population mean.
</small>
</blockquote>
<p>For further exploration:</p>
<ul>
<li>scipy.stats</li>
<li><a href="http://www.dartmouth.edu/~chance/teaching_aids/books_articles/probability_book/amsbook.mac.pdf">Introduction to Probability</a></li>
</ul>
<h1>Hypothesis and Inference</h1>
<h2>Hypothesis Testing</h2>
<script src="https://gist.github.com/albertyumol/c6397836bfeaf684d7dd15dbec33dd30.js"></script>
<h2>Confidence Interval</h2>
<script src="https://gist.github.com/albertyumol/236c57b6f0bc56af04e4bf004f3b29ff.js"></script>
<h2>Bayesian Inference</h2>
<p>A better alternative to hypothesis testing is the doing a Bayesian Inference procedure. This is done by interpreting probabilities not from the test but from the parameters themselves. This a bit technical on distributions we call beta (\(\beta\)) Distributions</p>
<p>(to be continued)</p>
<p>Needs more context and info</p>
<p>Want to collaborate? Message me in <a href="https://ph.linkedin.com/in/albertyumol">LinkedIn</a>.</p>
<p>Here are my resources:</p>
<p><strong>References</strong></p>
<p>[1] Joel Grus. Data Science from Scratch: First Principles with Python.</p>
<p>[2] something. something. Retrieved from:
<a href="https://en.wikipedia.org/wiki/Global_Database_of_Events,_Language,_and_Tone">something</a></p>
<p>[3] something. Retrieved from: <a href="https://www.gdeltproject.org/">something</a></p>
<p>[4] something. something. Retrieved from: <a href="https://en.wikipedia.org/wiki/Conflict_and_Mediation_Event_Observations">something</a></p>
<p>[5] something. something. Retrieved from: <a href="https://www.hindawi.com/journals/ddns/2017/8180272/">something</a></p>
<p>[6] something. something. Retrieved from: <a href="https://epjdatascience.springeropen.com/articles/10.1140/epjds/s13688-019-0183-y">https://epjdatascience.springeropen.com/articles/10.1140/epjds/s13688-019-0183-y</a></p>
<p>[7] something. something. Retrieved from :
<a href="https://tenor.com/view/well-be-watching-you-greta-thunberg-gif-15167876">something</a></p>
<script async="" src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
<script>
(adsbygoogle = window.adsbygoogle || []).push({
google_ad_client: "ca-pub-6410209740119334",
enable_page_level_ads: true
});
</script>
<div class="fb-comments" data-href="https://albertyumol.github.io/" data-numposts="5"></div>Albert YumolThis is a sample tutorial module on basic statistics for data science applications.Tutorial: Animating heatmap overlay-ed into Philippine map.2019-10-06T00:00:00+00:002019-10-06T00:00:00+00:00https://albertyumol.github.io//ph-map<div id="fb-root"></div>
<script async="" defer="" src="https://connect.facebook.net/en_US/sdk.js#xfbml=1&version=v3.2"></script>
<p>This is my third post on my Tutorial series. This time, we will do something geospatial and close to my heart, mapping the Philippines!</p>
<p>To do this in Python, we will be using coordinate mapping to establish boundaries. I am biased for matplotlib so it is all that we need for this tutorial, no seaborn, no bokeh and definitely no geopandas.</p>
<p>First, we need to find the coordinates of our Map. I used Google maps to identify latitude and longitude of a polygon that would cover the entire Philippines.</p>
<p>I set the origin of the map as (lat = 11.9, lon = 122.5). For polygon mapping, you only need two points to define an area. These coordinates are:</p>
<blockquote>
<small>
Lower left coordinate: llcrnrlon = 117, llcrnrlat = 5
Upper right coordinate: urcrnrlon = 127, urcrnrlat = 19
</small>
</blockquote>
<p>Here us the code to define the basemap in python.</p>
<script src="https://gist.github.com/albertyumol/ba3bd7c289e7ae041b8bafc2b43533b5.js"></script>
<p>And here is the blank map result:</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/rally/blank_map.png" alt="Base map." class="center" /></p>
<p>The next thing that we need is data. Since I am an activist, I am interested to visualize armed conflict in the Philippines. Luckily, a non-profit provides the data set.</p>
<blockquote>
<small>
The Armed Conflict Location & Event Data Project (ACLED) is a disaggregated data collection, analysis and crisis mapping project. ACLED records the dates, actors, types of violence, locations, and fatalities of all reported political violence and protest events across Africa, South Asia, Southeast Asia, the Middle East, Europe, and Latin America. Political violence and protest activity includes events that occur within civil wars and periods of instability, public demonstrations, and regime breakdown. ACLED’s aim is to capture the forms, actors, dates, and locations of political violence and protest as it occurs across states. The ACLED team conducts analysis to describe, explore and test conflict scenarios, and makes both data and analysis open to freely use by the public.[ACLED](https://www.acleddata.com/about-acled/)
</small>
</blockquote>
<blockquote>
<small>
The Armed Conflict Location & Event Data Project (ACLED) is a disaggregated conflict analysis and crisis mapping project.
ACLED is the highest quality, most widely used, real-time data and analysis source on political violence and protest around the world. Practitioners, researchers, and governments depend on ACLED for the latest reliable information on current conflict and disorder patterns.
</small>
</blockquote>
<p>I obtained data from 2016 to 2019 and aggregated the data per month by resampling. Here is my code:</p>
<script src="https://gist.github.com/albertyumol/5531429f7df3052ad99538cc64235e90.js"></script>
<p>The next part is making it animated. This is the code:</p>
<script src="https://gist.github.com/albertyumol/28d2adbfe1ebd84b31659e41c59e289e.js"></script>
<p>Here is the final output of the code.</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/rally/social_movement2.gif" alt="Philippine rallies." class="center" /></p>
<p>Want to collaborate? Message me in <a href="https://ph.linkedin.com/in/albertyumol">LinkedIn</a>.</p>
<script async="" src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
<script>
(adsbygoogle = window.adsbygoogle || []).push({
google_ad_client: "ca-pub-6410209740119334",
enable_page_level_ads: true
});
</script>
<div class="fb-comments" data-href="https://albertyumol.github.io/" data-numposts="5"></div>Albert YumolThis is a tutorial on how to animate a heatmap (a.k.a make a gif) and overlay it to Philippine base map using Python.Tutorial: Animating Time Series in Python2019-10-03T00:00:00+00:002019-10-03T00:00:00+00:00https://albertyumol.github.io//animate<div id="fb-root"></div>
<script async="" defer="" src="https://connect.facebook.net/en_US/sdk.js#xfbml=1&version=v3.2"></script>
<style type="text/css">
.gist {
margin-left: auto;
margin-right: auto;
width: 100% !important;
height: 100% !important;
}
.gist-data {
height:100%;
overflow-y: inherit;
width: 100%;
overflow-x: hidden;
}
</style>
<script src="https://gist.github.com/albertyumol/05f4fb0b726ec33971810985d246ceff.js"></script>
<p>Here is the rendered gif image from the code above:</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/animation/time1.gif" alt="Animate." class="center" /></p>
<p>Next post will be on animation of heatmap overlay-ed into Philippine map.</p>
<p>Want to collaborate? Message me in <a href="https://ph.linkedin.com/in/albertyumol">LinkedIn</a>.</p>
<script async="" src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
<script>
(adsbygoogle = window.adsbygoogle || []).push({
google_ad_client: "ca-pub-6410209740119334",
enable_page_level_ads: true
});
</script>
<div class="fb-comments" data-href="https://albertyumol.github.io/" data-numposts="5"></div>Albert YumolThis is a tutorial on how to animate a time series (a.k.a make a gif) in Python.Activism via Machine Learning: Modified Hidden Markov Model to forecast protest activities2019-09-25T00:00:00+00:002019-09-25T00:00:00+00:00https://albertyumol.github.io//Predicting-Rallies<div id="fb-root"></div>
<script async="" defer="" src="https://connect.facebook.net/en_US/sdk.js#xfbml=1&version=v3.2"></script>
<p>Have you heard about Greta Thunberg? That <em>‘…very happy young girl looking forward to a bright and wonderful future’.</em> She is all over the news and my twitter feed recently. Most of those in my digital circle post a lot of her <em>gifs</em> as she voice out her advocacies on the main stage of international climate summits. A lot seems to be attracted by her sense of purpose and ‘woke-ness’ at a young age.</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/rally/greta1.gif" alt="Be like Greta." class="center" /></p>
<p>There seems to be a lot of online clamor showing support for her activism. We tend to be drawn to hero figures like her and the extra-ordinariness of her cause. But sometimes, actions need not to be very extra-ordinary to be heroic.</p>
<p>Let me ask you if you yourself have met an activist like her? It seems unlikely when you live in the Philippines.</p>
<p>I myself is an activist like Greta. I may not be as young as her but I am just as pretty :). To be honest, I think being an activist in the Philippines (regardless of advocacy e.g. environment, basic social services, human rights, etc.) is much harder and dangerous. If you declare yourself as an activist and go out into the open, often than not you will be tagged as a member of the New Peoples Army, a communist, or even a radical terrorist. Being an activist in the Philippines means acceptance of the perils and circumstances that comes with it.</p>
<p>If you are active on Facebook, chance is high that you might have encountered news about atrocities against activists (mainstream media don’t usually report it). Just recently, news broke about forest rangers and protectors in Palawan being killed by paramilitary men and illegal loggers. I think hard about this. <strong>How can we tell the future generations to love and protect nature when people who do that suffer and are killed?</strong></p>
<p>As for my case, I have accepted these consequences and stay firm with my principles and advocacies to make a difference. When I was a student activist in University, my advocacy is on accessibility to education and ultimately pushing for free education for all. I believe that education is a right and not a privilege. I joined various demonstrations and rallies to forward and lobby this. As a science major, I always integrate my math skills to calculate the feasibility figures of free education and put it creatively in our rally boards and chants. The streets became my laboratory and the struggles of poor students became my thesis.</p>
<p>Eventually, through the pressure of decade long big rallies (see image below) and actual lobbying in congress (yes), in August 2017 The Universal Access to Quality Tertiary Education Act, or Republic Act 10931 was signed into law. Now, students from various walks of life can enjoy free education in all state colleges and universities in the country.</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/rally/test4.gif" alt="Education rallies." class="center" /></p>
<p>Education is only one part of the various struggles that Filipino citizens face on a daily basis spanning from the lack of basic social services, government unaccountability and state neglect, irresponsible mining, and the notorious extrajudicial killings in the face of <em>tokhang</em>. These struggles happen across the archipelago and experienced by various sectors of our society.</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/rally/social_movement2.gif" alt="Philippine rallies." class="center" /></p>
<p>In my years as a student activist, I learned that to build democracy, we need to put it in our collective hands. There are lessons in history that we should never forget like how we did it in People Power 1 and People Power 2 against the dictator Marcos and jueteng-lord Estrada. We can also learn lessons from recent collective actions of our neighbors in Hong Kong against extradition and in student rallies in Indonesia against proposed new laws on criminalization if extramarital sex and in insulting their president’s honor.</p>
<p>Nowadays, much of our time are spent online, thus emerges a new type of activists. They are often called as keyboard activists. They are the ones who initiate twitter rallies and share ‘woke memes’. Maybe you yourself have shared or retweeted their content.</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/rally/twitter.jpg" alt="Twitter rallies." class="center" /></p>
<p>For this particular project, what I am interested is <strong>when do these activists flood the streets? When will their digital words turn into real-life actions?</strong></p>
<p><strong>Data Wrangling</strong></p>
<p>To do that I need a lot of data. Luckily, the GDELT project provides more than enough.</p>
<blockquote>
<small>Global Database of Events, Language, and Tone (GDELT), created by Kalev Leetaru of Yahoo! and Georgetown University, along with Philip Schrodt and others, describes itself as "an initiative to construct a catalog of human societal-scale behavior and beliefs across all countries of the world, connecting every person, organization, location, count, theme, news source, and event across the planet into a single massive network that captures what's happening around the world, what its context is and who's involved, and how the world is feeling about it, every single day." [2].</small>
</blockquote>
<p>The GDELT Project is a real time network diagram and database of global human society for open research [3]. It monitors print, broadcast, and web news media in over 100 languages from across every country in the world to keep continually updated on breaking developments anywhere on the planet. Its historical archives stretch back to January 1, 1979 and update every 15 minutes. Through its ability to leverage the world’s collective news media, GDELT moves beyond the focus of the Western media towards a far more global perspective on what’s happening and how the world is feeling about it.</p>
<p>I used Google’s Big Query to get the event logs of all news in the Philippines since year 2000 (here is the script I used):</p>
<script src="https://gist.github.com/albertyumol/3715a1cb2c5efb96269b05ac4dce0d02.js"></script>
<p>These are the steps that I followed:</p>
<blockquote>
<small>
1. Ground Set Extraction <br />
2. Burstiness Modelling <br />
3. Hidden Markov Modelling <br />
4. Naive Bayes Decision
</small>
</blockquote>
<p>Each record in GDELT has 61 fields, pertaining to a specific event in CAMEO format.</p>
<blockquote>
<small>
Conflict and Mediation Event Observations (CAMEO) is a framework for coding event data (typically used for events that merit news coverage, and generally applied to the study of political news and violence). [4]
</small>
</blockquote>
<p>For my daily aggregation, I only need and obtained these fields:</p>
<blockquote>
<small>
SQLDATE, MonthYear, EventRootCode, GoldsteinScale, NumMentions, AvgTone, ActionGeo_CountryCode, ActionGeo_Lat, ActionGeo_Long
</small>
</blockquote>
<p>Note:</p>
<blockquote>
<small>
<b>GoldsteinScale</b> is a numerical score ranging from -10 to 10 which signifies the theoretical potential impact that type of event will have on the stability of the country. <b>NumMentions</b> is the total number of mentions of this event across all source documents, which can be used as a method of assessing the importance of an event: the more the discussion of the event is, the more likely it is to be significant. <b>AvgTone</b> is the average tone of all documents containing one or more mentions of this event ranging from -100 (extremely negative) to 100 (extremely positive) [5]. <b>Action_Geo_Country</b> code is the location of the event, <b>ActionGeo_Lat</b> and <b>ActionGeo_Long</b> are the centroid latitude and longitude of the landmark which I used to plot the Philippine map gif above.
</small>
</blockquote>
<p><strong>Ground Set Extraction</strong></p>
<p>The events in GDELT are categorized by identified themes labeled 1 to 20. The theme includes verbs describing the type of action of an event like reject (12), protest (14), threaten (13), coerce (17), assault (18), etc.</p>
<p>To get the ground truth, event root code number 14 signifies events with mentions of PROTEST. I filtered out only those who have significant number of mentions across the years. I started with 2000 and aggregated on a daily basis. Plotting the time series,</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/rally/unnormalized.png" alt="Dirty timeseries." class="center" /></p>
<p>we see that there is a heterogeneous upward trend in the event mentions [5]. To remove this, I have implemented a 90 day moving average to normalize the signal using this equation:</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/rally/eq3.png" alt="Equation 1." class="center" /></p>
<p>To set the baseline value of number of significant event values, we define that the average mention count on each day is given by</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/rally/eq4.png" alt="Equation 2." class="center" /></p>
<p>and to smoothen out the data, we use a <strong>seven day</strong> moving average:</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/rally/eq5.png" alt="Equation 3." class="center" /></p>
<p>where \(\theta\) is the upper bound of the 95% confidence interval of the time series. It’s a lot of math, I know (same girl I can relate, also took me a while to understand this) but we aren’t yet discussing the model which is more intense (and fun!).</p>
<p>Upon normalization, here is the result:</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/rally/normalized.png" alt="Equation 4." class="center" /></p>
<p>All of the points above the red line are days with significant rallies.</p>
<p>I used these points as labels and reduce my problem to a supervised binary classification. Those days with significant count of rally mentions are labeled as 1 while those with none are tagged as 0.</p>
<p>I will discuss the details of the coupled Burstiness and Hidden Markov Model (HMM) that I implemented in another blog post. I chose HMM because it accounts for time series variation as sequence learning and coupled it with Burstiness Modelling to properly account for the probabilities and duration of events (also called as states).</p>
<p>Research in social movements point out there are mini-events that lead to the occurrence of a big rally. In this study, I hypothesize that this event progression is given by this ladder:</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/rally/ladder.png" alt="Ladder of Social Movement" class="center" /></p>
<p>These five so-called states (identified in GDLET as event root codes 10, 11, 12, 13, and 14) are what I used to define the observation vector needed by the HMM. To increase accuracy of prediction, I also added the AvgTone and GoldsteinScale (discussed above) to the dimension of the observation. Thus, the final vector is given by:</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/rally/eq2.png" alt="Equation 4." class="center" /></p>
<p>HMM is a strong candidate for this problem because it can approximate the likelihood of an event occurring given the probabilities of the progression of the states. This statement can be visualized by this diagram:</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/rally/states.gif" alt="Probability States" class="center" /></p>
<p>The numbers in the diagram indicates probabilities when a particular state will transition to another state. HMM is used to estimate the parameters for the model. To increase the accuracy of the solution, I trained two models. One is trained to classify if a date range contains a rally (SM-prone) and the other model is trained to identity non-rally days (SM-free) as exemplified by this diagram:</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/rally/model.png" alt="Modelling diagram" class="center" /></p>
<p>I used <strong>2000-2015 as my training set</strong> and <strong>2016-2019 as my test set</strong>. I implemented a Bayes log likelihood decision mechanism to decide which model is more accurate in a given date prediction range.</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/rally/predict.png" alt="Prediction scheme" class="center" /></p>
<p>For baseline model comparison, since I reduced the problem into a supervised binary classification, I used <em>Logistic Regression</em>. It is commonly used in a lot of machine learning methods in the event prediction and forecasting literature. For each day, I summed over all event mentions with event root code 14 and use it as an indicator of big rallies.</p>
<p><strong>Results</strong></p>
<p>The ROC curve of the model compared to the <em>Logistic Regression</em> baseline is shown below:</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/rally/resulta.png" alt="ROC curves" class="center" /></p>
<p>Here are the accuracies and precisions:</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/rally/accuracy.png" alt="Performance metrics" class="center" /></p>
<p>As we can see, the modified Hidden Markov Model performed better than Logistic Regression. Using a time window of seven days means that this particular model implementation will be able to predict if a big rally will happen within the next 7 days.</p>
<p><strong>Conclusion</strong></p>
<p>The results can be useful depending on your biases. If you are an activist like me who organize people to rally for certain causes and advocacies, you would know if there is enough online clamor before a rally occur and encourage more discussions to meet a certain threshold value.</p>
<p>If you are a member of the reactionary state force, you will most likely use this prediction to suppress social movements in favor of the status quo.</p>
<p>If you are a normal citizen, you decide whether the issues being talked about speak relevance and weigh in future consequences.</p>
<p><strong>Recommendations</strong></p>
<p>Current research point out that social embeddedness, emotions, grievance and identity [6] are the most important features that influence protest activities. This can be verified using NLP and feature importance from the news data set used above and can be done as an extension of this project.</p>
<p><strong>Takeaways</strong></p>
<p>My bias is that I am an activist. I believe that activism is a way of life. And everyone of us are all activists in our own little ways. Sometimes we just need to be reminded why we do the things we do and ultimately ask <em>for whom</em> do we do these things.</p>
<p>For my case, I will continue to join rallies and stick to my principles as I believe that only through collective action can we truly achieve genuine social change. As for the corrupt politicians in the Philippines,</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/rally/greta2.gif" alt="Greta is angry." class="center" /></p>
<p>See you in the future and hopefully in future rallies and protest actions.
#BeLikeGreta
#DataScienceForThePeople</p>
<p><strong>Credits</strong></p>
<p>I want to acknowledge <a href="https://www.eskwelabs.com/">Esklwelabs</a> in pursuit of this project. Also shout out to Data Science Fellow Cohort II. You are the best data science study buddies! I know you guys will be actuators in your future endeavors. Sending lots of love and virtual hugs :)</p>
<p>Want to collaborate? Message me in <a href="https://ph.linkedin.com/in/albertyumol">LinkedIn</a>.</p>
<p><strong>References</strong></p>
<p>[1] Gfycal. Donald Trump and Greta Thunberg 1. Retrieved from :
<a href="https://gfycat.com/fancycoarsekrill-donald-trump">https://gfycat.com/fancycoarsekrill-donald-trump</a></p>
<p>[2] Wikipedia. Global Database of Events, Language, and Tone. Retrieved from:
<a href="https://en.wikipedia.org/wiki/Global_Database_of_Events,_Language,_and_Tone">https://en.wikipedia.org/wiki/Global_Database_of_Events,_Language,_and_Tone</a></p>
<p>[3] The GDLET Project. Retrieved from: <a href="https://www.gdeltproject.org/">https://www.gdeltproject.org/</a></p>
<p>[4] Wikipedia. Conflict and Mediation Event Observations. Retrieved from: <a href="https://en.wikipedia.org/wiki/Conflict_and_Mediation_Event_Observations">https://en.wikipedia.org/wiki/Conflict_and_Mediation_Event_Observations</a></p>
<p>[5] Discrete Dynamics in Nature and Society. Predicting Social Unrest Events with Hidden Markov Models Using GDELT. Retrieved from: <a href="https://www.hindawi.com/journals/ddns/2017/8180272/">https://www.hindawi.com/journals/ddns/2017/8180272/</a></p>
<p>[6] EPJ Data Science. Activism via attention: interpretable spatiotemporal learning to forecast protest activities. Retrieved from: <a href="https://epjdatascience.springeropen.com/articles/10.1140/epjds/s13688-019-0183-y">https://epjdatascience.springeropen.com/articles/10.1140/epjds/s13688-019-0183-y</a></p>
<p>[7] Gfycal. Donald Trump and Greta Thunberg 2. Retrieved from :
<a href="https://tenor.com/view/well-be-watching-you-greta-thunberg-gif-15167876">https://tenor.com/view/well-be-watching-you-greta-thunberg-gif-15167876</a></p>
<script async="" src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
<script>
(adsbygoogle = window.adsbygoogle || []).push({
google_ad_client: "ca-pub-6410209740119334",
enable_page_level_ads: true
});
</script>
<div class="fb-comments" data-href="https://albertyumol.github.io/" data-numposts="5"></div>Albert YumolSocial movements exhibit a complex system of social human behavior. These events demonstrate the capacity of people and their collective action to influence political decisions and public policies. This study delves into developing a model to predict future events of big rallies and protest in the Philippines by correlating it to online dissent and mentions in news outlets and social media using a coupled Burstiness and Hidden Markov Model.‘open’ laptop anyone?2019-05-18T00:00:00+00:002019-05-18T00:00:00+00:00https://albertyumol.github.io//openhardware<div id="fb-root"></div>
<script async="" defer="" src="https://connect.facebook.net/en_US/sdk.js#xfbml=1&version=v3.2"></script>
<p>Welcome to my first entry on open electronics! It has been a while since my last blog entry. You know, as an activist, the political arena in the Philippines takes a lot on people. Who won’t be stressed with Duterte’s
Drug war, threats of Martial Law, Jeepney phaseout, the unsolicited Build Build Build Program and its Tax Reform for Acceleration and Inclusion (TRAIn) Law? Let’s take a short break from all of these commotions and discuss how I managed to switch to a new laptop on a budget.
I have been a MacBook Pro user for about 5 years already. One of my weird practice is to christen my computers. I named my MacBook Merlin based from the fictional Welsh figure from the legend of King Arthur because I really like magick personalities. Merlin served me well since he’s a Mac and I use computational power intensively, like running python codes for weeks and video editing on the side.</p>
<p>But even before, I have had software and hardware compatibility issues with Mac. It is not straightforward upgrading Java or even running Arduino IDE. Often I used virtual machines to run my favorite Linux distros which of course by itself is very memory intensive. Added to these issues are of security and weight. Recently I have had Ads popping up with my browser closed. Mac is also very heavy and heats up a lot! All in all, my Mac experience has been good in that it can sustain my daily dose of computational power but it felt very restricting in terms of the amount of modification I can render to the system.
Time came when I finally gave up on Mac and wiped it clean with a new Ubuntu OS. But of course Linux on Mac is not necessarily a good idea because the MacBook is made for MacOS.</p>
<p>So I decided to move on and charge it to my experience. It’s time to find a new laptop. But I have one major problem. I’m broke. As an un-salaried NGO volunteer, I have very limited budget from my savings. At max, I can only spend around 10,000 pesos - all in.</p>
<p>As a thrift-techie enthusiast that I am, I happen to be familiar with the nooks and crannies of the cheapest finds for most things that run with electricity. During my spare time, I just usually visit these thrift-tech stores just looking around, familiarizing myself with latest gadgets (not necessarily phones and laptops).</p>
<p>My top choice is the well-known tech-thrift place in Manila - Gilmore I.T. Center.</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/open_hardware/gilmore.png" alt="Gilmore" class="center" /></p>
<p>This place is a haven for people like me. Everything is cheap and most of the prices can be haggled. Before, I went to Gilmore, I did initial research and created a checklist for the cheap laptop. As a self-proclaimed hipster, form factor is important. I want it to be light and unique. My research brought me to a specific brand that suited my criteria - the IBM Thinkpad.</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/open_hardware/lenovo.png" alt="Lenovo X200" class="center" /></p>
<p>Just look at that monster! How hipster could you get with that legendary mechanical keyboard and the red trackpoint as a mouse? At first sight, I was really flabbergasted by the design of the IBM Lenovo Thinkpad. It came to a point that this monster visited me in sleep.</p>
<p>The next morning, I found myself gallivanting around Gilmore IT Center looking for my next best friend. I found many blogs and vlogs on a particular model of the Lenovo Thinkpad but one rose above them all - the mighty x200.</p>
<p>I was already at the 4th floor of the building looking keenly on my catch. I found a lot of good stalls with really crazy offers such as PC Green.</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/open_hardware/pc_green.jpg" alt="PC Green" class="center" /></p>
<p>They sell laptop units for as low as 2500 pesos for brands which includes NEC, Fujitsu, Samsung - mostly Japanese surplus. Sadly, they don’t have the x200. As I was about to lose hope that day, I turned to the last shop in front of PC Green - Rhem’s Computer Trading.</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/open_hardware/rem.png" alt="REM Shop" class="center" /></p>
<p>A sales lady approached me and asked what I was looking for. I said Thinkpad x200 and she replied knowingly pitching about hardware specs and availability. Luckily, they have one more stock of the model.</p>
<p>And lo and behold, I was on my way home ready to tweak my loot. The unit that I bought was almost good as new (99% smooth) with 2GB DDR3 of RAM, Intel Core 2 Duo, 160 GB of HDD for a measly 5000 pesos.</p>
<p>Before I went home, I also bought a 120 GB SSD drive to ramp up the computer speed. The first thing that I did upon returning home was replacing the drive and adding another 2 GB DDR3 of RAM. Here are photos of the WD SSD I bought for 2900 pesos and my spare RAM from my previous laptop.</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/open_hardware/internal.png" alt="Internal parts" class="center" /></p>
<p>The laptop came with Windows 7 Vista installed but as a Linux boy since high school, Windows was never and will never be good enough for me. At first, what I planned to do was to install an open bios called Libreboot to the laptop but I am currently in no luck to find a SOIC IC connector pin for me to hack the hardware. This is what the SOIC pin looks like, so if you happen to have one please send me help.</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/open_hardware/soic.png" alt="SOIC Jumper" class="center" /></p>
<p>There is a work around however but it requires me soldering wires through the motherboard and connecting it to GPIO pins of a Raspberry Pi. I’ll probably do that as a last option. Why am I so concerned with changing the BIOS loader? Because Intel processors are rumored to have backdoors. For fool proof security, this needs to be hacked. For now, I’ll settle for a plain Linux installation. Here is what the laptop looks like.</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/open_hardware/finished.png" alt="Actual Pretty Laptop" class="center" /></p>
<p>I installed my favorite distro, the gorgeous Elementary OS. My only concern now is the battery life which is not really bad. It clocks at around 90 minutes so when I have the funds I will buy a 9 cell battery for a day long of volt juice.</p>
<p>I have also made my ceremonies upon the installation of the OS. I have installed my security basics and networking such as TOR browser, VeraCrypt, and the nmap packages. I have also ramped up the desktop experience by installing numix. Here is a snapshot of my desktop.</p>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/open_hardware/desktop.jpg" alt="Desktop" class="center" /></p>
<p>Gorgeous isn’t it? Amazingly, the x200 line offers one of the very first finger print scanning for logging in. I was able to install it in a jiffy by using the package FPrint with this code.</p>
<blockquote>
<small>sudo apt install libpam-fprintd fprint-demo</small>
</blockquote>
<p style="text-align: center;"><img src="https://albertyumol.github.io//images/open_hardware/fingerprint.jpg" alt="Fingerprint" class="center" /></p>
<p>Booting takes at most 20 seconds. The experience felt amazing. My applications that took long in Mac to load, run in an instant. Thanks to the SSD and the power of Linux!</p>
<p>In summary, I was able to find a laptop on budget! ‘Maxed’ the specs up still on a budget and run it with free and open source softwares. Looking forward, I hope to find a SOIC connector soon so I can Libreboot my laptop like the Unix God Richard Stallman.</p>
<p>But so far, I am a very happy kid with this find. I have sought help from my geek friends to find an oracle that can lead me to a SOIC connector. With that I’m gonna name the laptop Endor, after the sorceress who King Saul of the bible sought to predict his fate with the war against Philistines.</p>
<p>What do you think? Leave a comment below, and don’t forget to say ‘hi’.</p>
<p>Want to collaborate? Message me in <a href="https://ph.linkedin.com/in/albertyumol">LinkedIn</a>.</p>
<script async="" src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
<script>
(adsbygoogle = window.adsbygoogle || []).push({
google_ad_client: "ca-pub-6410209740119334",
enable_page_level_ads: true
});
</script>
<div class="fb-comments" data-href="https://albertyumol.github.io/" data-numposts="5"></div>Albert YumolIt's time to find a new laptop. But I have one major problem. I'm broke. As an un-salaried NGO volunteer, I have very limited budget from my savings. At max, I can only spend around 10,000 pesos - all in. As a thrift-techie enthusiast that I am, I happen to be familiar with the nooks and crannies of the cheapest finds for most things that run with electricity. During my spare time, I just usually visit these thrift-tech stores just looking around, familiarizing myself with latest gadgets (not necessarily phones and laptops).