index.html

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <link rel="preconnect" href="https://fonts.googleapis.com">
    <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
    <link href="https://fonts.googleapis.com/css2?family=PT+Serif&display=swap" rel="stylesheet">
    <link href="https://fonts.googleapis.com/css2?family=DM+Sans&display=swap" rel="stylesheet">
    <link rel="stylesheet" href="style.css">
    <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.7.1/jquery.min.js"></script>
      
      
<title>DataSprint - Online Harassment A</title>
    
</head>
<body>
    <div id="container">
        <div id="first-section">
            <div id="title"> DataSprint</div>
            <div id="subtitle">Mapping AI, </br>
                Online Harassment </br>
                and Online Polarization
                </div>
            <div id="logo-container">
                <a href="https://densitydesign.org/"><img class="logo" src="assets/densitylogo.png" alt="Logo 1"></a>
                <a href="https://www.polimi.it/"><img class="logo" src="assets/polilogo.png" alt="Logo 2"></a>
            </div>
        </div>
        <div id="second-section" onmouseenter="enableScroll()" onmouseleave="disableScroll()">
            <div id="right-title">Mapping Online Harassment</div>
            <div id="text-content"> 
                We investigated the intersection of Online Harassment and AI, addressing specific tools, prevalence, and accessibility. Our study also analyzed the language used in online harassment discussions, identifying prevalent words across platforms. We explored platforms with a higher concentration of such words and delved into specific communities fostering their extensive use. This comprehensive approach illuminates the multifaceted dimensions of Online Harassment and its connection to artificial intelligence and synthetic media.
                
        
                <div class="img_producto_container" data-scale="1.9">
                    <a
                      class="dslc-lightbox-image img_producto"
                      href="assets/summary.svg"
                      target="_self"
                      style="background-image:url('assets/summary.svg')"
                    >
                    </a>
                  </div>
                  

                <div id="right-small-title">Data Scraping</div>
                <div id="text-small-content">
                    The tools we used to scrape content online were mainly from <a href="https://console.apify.com/">Apify</a> and <a href="https://webscraper.io/">Webscraper.io</a>
                    Scraping data from a generic “Online harassment” query we obtained: </br>

                    <ul>
                        <li>10.000 elements from Reddit</li>
                        <li>650 elements from X (Twitter)</li>
                        <li>270 elements from Tiktok</li>
                        <li>100 elements from Threads</li>
                        <li>2500 elements regarding AI tools from <a href="https://www.kaggle.com/datasets/muhammadtalhaawan/ai-5000-tools-2023?resource=download">Kaggle</a></li>
                      </ul>  

                </div>

                <div id="right-small-title">Cleaning</div>
                <div id="text-small-content">
                    Reading every corpus, we deleted every non user generated content
                    With <a href="https://colab.research.google.com/drive/1cDtP-hYE9FgK45iYraREqd-ZEop-dZld?usp=sharing#scrollTo=7bF9nhaaMAQ3">Google Colab NLP Toolkit</a>, removed stopwords. After transforming the file in a .docx we uploaded the file on <a href="https://www.laurenceanthony.net/software/antconc/">AntConc</a>
                    The software finds the word before and after, counting and making groups (Trinomes) that were discarded in our process, focusing on single words and their density.
                    Back to Colab, we did a lemmatization in order to make the wording.
                </div>

                Take a look at our  <a href="https://docs.google.com/spreadsheets/d/1567v-gIH2GBNrznAQlGcYoH8bljscqZjdkUx5ve5GGY/edit#gid=999940645">dataset →</a>

            </div>
            
            <!--AI TOOLS SECTION-->
            <div id="right-title">AI Tools Exploration</div>
            <div id="text-content">
                Our initial undertaking involved a comprehensive mapping of the diverse array of AI tools available online, categorizing them based on their respective types, encompassing text, image, 3D video, audio, code, business, and other classifications. Subsequently, we extended our analysis to gauge the accessibility of these tools by delineating their pricing structures. 
            </div>

            <div class="flourish-embed flourish-survey" data-src="visualisation/16588785"><script src="https://public.flourish.studio/resources/embed.js"></script></div> 

            <div class="image-text-section">
                <div id="text-small-content">
                </br>
                    <p id="keyfindings-small-title">Keyfindings</p>
                    <ul>
                        <li>The most prevalent category of tools is "business," followed by "text," with "video" and "3D" being less common.</br></br>
                        <li>Most of the tools are easily accessible, with the majority being either free or falling within the lower price range (0.1-10$).</br></br>
                        <li> Across different categories, there are no significant variations in the distribution across price ranges. Where there is a higher presence of tools, the number of both free and expensive tools increases proportionally, maintaining a relatively balanced distribution.
                        </ul>  
        </div>           
    </div>


            <!--HARASSMENT 1 SECTION-->
            <div id="right-title">Online Harassment Across Platforms: </br> Word Analysis</div>
            <div id="text-content">
                Following that, the mapping process began with an exploration of discussions regarding "online harassment" across Threads, TikTok, and Twitter. A systematic collection and analysis of all comments related to the topic were conducted, extracting the words utilized by individuals to articulate their perspectives. Subsequently, these words were visually represented in a graphic, their positions signifying prevalence on each platform. Words positioned at the center of the graphic indicated commonality across all three platforms, providing insights into the shared discourse surrounding online harassment.
            </div>

            <iframe style="margin-left:3vw" width="80%" height="70%" src="https://ouestware.gitlab.io/retina/1.0.0-beta.1/#/embed/?url=https%3A%2F%2Fgist.githubusercontent.com%2Felisabettacomo%2F0f5b4e3a18071d14e5fe61c8a2b0eeef%2Fraw%2Fbe6aeeecff211164eddd96cb573fa8391bbfa7ca%2FCOLOR-2.graphml&sa=r&ca[]=wi-s&ca[]=wo-s&ca[]=wd-s" frameBorder="0" title="Retina" allowFullScreen></iframe>
            <div id="keyfindings-small-title">Keyfindings</div>

            <div class="image-text-section">
                <div class="image-content">
                    <img src="assets/keyfinding6.png" alt="Image Description">
                </div>
                <div id="text-small-content-keyfindings">
                    <p id="keyfindings-small-title">Threads use of terms on AI-linked harassment</p>
                    It can be noticed that within Threads a notably specific and technical language about online harrassment linked to artificial intelligence is employed.
                     </div>
            </div>

            <div class="image-text-section">
                <div class="image-content">
                    <img src="assets/keyfinding5.png" alt="Image Description">
                </div>
                <div id="text-small-content-keyfindings">
                    <p id="keyfindings-small-title">Scarcity of shared terms</p>
                    It is interesting to observe the scarcity of common terms among the three platforms, but it's curious to notice in the center the binary relationship between "safety" and "threat".
                    </div>
            </div>

            
            <!--REDDIT 1 SECTION-->
            <div id="right-title">Insights from 32 subcommunities</div>
            <div id="text-content">
                All data extracted from Reddit originates from 32 different subreddits, which we have mapped in a visualization. The size of each bubble in the visualization is proportional to the number of posts and comments for each subreddit. This mapping and overview of the various subreddits are essential to understand the starting point of the dialogue, thus revealing the intentions and perspectives of the community, which may lead to various polarizations. The topics of the subreddits range from news and politics in general to information about entertainment and technology.
             </div>
            <div style="margin-left: 2vw;" class="flourish-embed flourish-bubble-chart" data-src="visualisation/16580759"><script src="https://public.flourish.studio/resources/embed.js"></script></div>
            
            <div class="image-text-section">
                <div id="text-small-content">
                </br>
                    <p id="keyfindings-small-title">Keyfindings</p>
                    <ul>
                        <li>The more specific and polarized communities, as evident from their descriptions, include KotakuInAction, the hub on Reddit for the GamerGate – a controversial 2014 movement on gaming ethics and journalism, marked by harassment, misogyny, and discussions on industry transparency and inclusivity. On the other hand, we have TwoXChromosomes, intended as a space for sharing content from a feminine perspective to promote respect. The only subreddit closest to the research query for topics is undoubtedly Instagramreality, aiming to showcase edited photos and reveal the truth behind them.
                    </ul>  
        </div>           
    </div>

            <!--REDDIT 2 SECTION-->
            <div id="right-title">Reddit's Take on Online Harassment</div>
            <div id="text-content">
                Next, a specific emphasis was directed towards the Reddit platform, conducting an extensive overview of the language employed in discussions about online harassment. Notably, our examination delved into presenting distinctive perspectives within each subreddit community addressing the topic. These communities showcase diverse viewpoints, often positioned in opposition, characterized by unique tones of voice and perceptions of the subject matter.
            </div>
            <iframe style="margin-left:3vw" width="80%" height="60%" src="https://ouestware.gitlab.io/retina/1.0.0-beta.1/#/embed/?url=https%3A%2F%2Fgist.githubusercontent.com%2Felisabettacomo%2F7952f6771a5d1f8e1fd8cb1667d9bf45%2Fraw%2Fb602e21f56e56631511fe0558eac387c7e236869%2Freddit-comm.graphml&sa[]=ge&sa[]=gp&sa[]=r&ca[]=m-s&ca[]=gu-s" frameBorder="0" title="Retina" allowFullScreen></iframe>
            <div id="keyfindings-small-title">Keyfindings</div>

            <div class="image-text-section">
                <div class="image-content">
                    <img src="assets/keyfinding1.png" alt="Image Description">
                </div>
                <div id="text-small-content-keyfindings">
                    <p id="keyfindings-small-title">Subreddit specific lexicons</p>
                    It is interesting to observe the scarcity of common terms among the three platforms, but it's curious to notice in the center the binary relationship between "safety" and "threat.
                 </div>
            </div>

            <div class="image-text-section">
                <div class="image-content">
                    <img src="assets/keyfinding2.png" alt="Image Description">
                </div>
                <div id="text-small-content-keyfindings">
                    <p id="keyfindings-small-title">Shared terms and controversies across ubreddits</p>
                    Shared terms and keywords prevalent across diverse communities invariably center on online harassment and the use of assertive language, encompassing terms such as "rape," "sex," "porn," and "hate." The language employed often incorporates elements that could be characterized as constituting hate speech or potentially offensive content, intricately woven into the overarching themes intrinsic to the respective topics.
                 </div>
            </div>

            <div class="image-text-section">
                <div class="image-content">
                    <img src="assets/keyfinding3.png" alt="Image Description">
                </div>
                <div id="text-small-content-keyfindings">
                    <p id="keyfindings-small-title">Gendered threads</p>
                    Examining the linguistic landscape of two contrasting subreddits, r/unitedkingdom and r/TwoXChromosomes, reveals a thematic cluster of terms predominantly related to gender and identity.
                </div>
            </div>

            <div class="image-text-section">
                <div class="image-content">
                    <img src="assets/keyfinding4.png" alt="Image Description">
                </div>
                <div id="text-small-content-keyfindings">
                    <p id="keyfindings-small-title">'Woman' as a marker for gender harassment </p>
                    The term most consistently shared across the platforms is "woman," indicating a likely association with the category of "gender harassment."
                 </div>
            </div>


            <!--REDDIT 3 SECTION-->
            <div id="right-title">Evolution of keywords</div>
            <div id="text-content">
                This race bar chart illustrates the temporal evolution of key terms within a dataset encompassing 10,000 subreddits, comprising both posts and comments. The selected timeframe spans from 2020 to 2023, offering insight into lexical trends over this short period. The chosen words were arbitrarily selected from the top 30 most frequently used terms on Reddit, ensuring diversity, interest, and significance in capturing a trend.This race bar chart illustrates the temporal evolution of key terms within a dataset encompassing 10,000 subreddits, comprising both posts and comments. The selected timeframe spans from 2020 to 2023, offering insight into lexical trends over this short period. The chosen words were arbitrarily selected from the top 30 most frequently used terms on Reddit, ensuring diversity, interest, and significance in capturing a trend.
            </br></br></br></br>
            
            <div class="flourish-embed flourish-bar-chart-race" data-src="visualisation/16587439"><script src="https://public.flourish.studio/resources/embed.js"></script></div>
        </br></br>
            <div class="image-text-section">
                <div id="text-small-content">
                    <p id="keyfindings-small-title">Keyfindings</p>
                    <ul>
                        <li>Primarily, a positive correlation emerged between the terms "Ai" and "Video" over the years, suggesting a rise in content involving artificial intelligence in online videos on the platform.                        </li>
                    </br>
                        <li>Another critical point is the association between mentions of "Hate" and "racist," with a peak in 2021 followed by a decline in 2022. This may indicate increasing awareness and actions against racism during that period, reflecting an active response to the issue.                        </li>
                    </br><li>The analysis highlights a consistent growth in interest concerning artificial intelligence, indicated by the steady increase in mentions of "Ai" from 2020 to 2023. This suggests a growing concern or curiosity within the Reddit community regarding AI-related topics.                        </li>
                </br><li>Furthermore, the consistent mention of "Video" indicates an ongoing community focus on multimedia content, emphasizing that a significant portion of discussions and content on Reddit revolves around video material.                        </li>
            </br><li>Lastly, the term "Help" showed a continuous increase during the considered period, suggesting a rise in help requests or discussions on support-related topics within the platform. This underscores a growing community interaction on support-related issues.</li>
                      </ul>  
        </div>           
    </div>

    <script>
        function toggleFullscreen() {
            var imageContent = document.querySelector('.image-content');
            imageContent.classList.toggle('fullscreen');
        }
    </script>

    <script>
        function enableScroll() {
            document.getElementById('second-section').style.overflow = 'auto';
        }

        function disableScroll() {
            document.getElementById('second-section').style.overflow = 'hidden';
        }
    </script>

<script>
$(".img_producto_container")
  // tile mouse actions
  .on("mouseover", function() {
    $(this)
      .children(".img_producto")
      .css({ transform: "scale(" + $(this).attr("data-scale") + ")" });
  })
  .on("mouseout", function() {
    $(this)
      .children(".img_producto")
      .css({ transform: "scale(1)" });
  })
  .on("mousemove", function(e) {
    $(this)
      .children(".img_producto")
      .css({
        "transform-origin":
          ((e.pageX - $(this).offset().left) / $(this).width()) * 100 +
          "% " +
          ((e.pageY - $(this).offset().top) / $(this).height()) * 100 +
          "%"
      });
  });

  </script>


</body>
</html>