CrowdTangle: Visualizing Communalytic’s Two-mode Semantic Network in Gephi with Facebook and Instagram Data (Part 3/3)

SNA programs like Gephi provide additional ways to visualize and analyze networks exported from Communalytic. This quickstart guide will walk you through some basic data preparation and visualization steps in Gephi. For a more indepth look at Gephi, please refer to Gephi online learning resources here.

This guide will show you how to:

  • choose a layout for your Communalytic’s Two-mode Semantic Network
  • assign colors to each node based on the node’s type, 
  • set node size to reflect ‘popularity’ of Semantic Nodes (named entities), 
  • merge related Semantic Nodes
  • remove false-positive Semantic Nodes and 
  • hide Semantic Nodes that were mentioned by only a few Actor Nodes (either Facebook pages/groups or Instagram accounts). 

These steps will help you reduce the clutter and complexity of the resulting visualization and in turn, allow you to focus your attention on more salient features of the network.

1. Opening Communalytic’s Two-mode Semantic Network *.graphml file in Gephi

After downloading, installing and launching the Gephi software on your computer, click [File] -> [Open] in the main menu to open the *.GraphML file that you had previously downloaded from Communalytic as shown below. 

2. Choosing a Layout

After opening your GraphML file in Gephi, the next step is to select a layout algorithm in the [Overview] section of Gephi. There are many different layouts available. For smaller size networks (n<1000), we suggest using the [Fruchterman Reingold] layout. 

If you are working with large networks, we suggest using [ForceAtlas 2], which takes advantage of multiple CPU cores/logical processors available on your computer to speed up the visualization process by running it using multiple threads (in our case, we set it to 7 in the Thread number parameter).

Once you select a layout, click on the [Run] button to start the visualization process. Some layouts will stop automatically after completing a pre-set number of iterations, but others will run until you click [Stop

When the visualization process is done, take a look at the resulting network visualization to see how easy it is for you to interpret the network structure. As this step is iterative, if you find that the resulting network visualization is dense and difficult to decipher, consider running and trying out other layouts to find one that is ideal for your dataset. In our example, we tried out two different layouts and found that Fruchterman Reingold produces an overall cleaner network with fewer overlapping nodes and edges which will make it easier to examine the network’s different regions and nodes (see Figure below).  

ForceAtlas 2 Layout 
Fruchterman Reingold Layout 

3. Changing Node Size

Next, to make the network visualization even more intuitive and easy to understand (visually speaking), you can also change the node size for all of the Semantic Nodes (a.k.a.  named entities) found in your dataset. This will make it easier for you to identify named entities that were mentioned more often or less often in comparison to other name entities found in your dataset. (See our tutorial “CrowdTangle:…(Part 2/3)” for more info about name entities)

To do this, click on the [Appearance] tab -> [Nodes] tab -> [Ranking] tab, then select the [In-Degree] centrality and click [Apply].Selecting the In-degree centrality measure in [Appearance] will cause Gephi to resize Semantic Nodes(named entities) in the network based on the total number of Actor Nodes (either Facebook pages/groups or Instagram accounts) that mentioned a particular Semantic Node. The end result of this change is that Semantic Nodes that represent frequently mentioned named entities in your dataset will appear larger in the network relative to named entities that were mentioned less often.

Alternatively, you can choose other SNA metrics to size nodes in the network. The [out-degree] centrality, for instance, will help to identify Actor Nodes (either Facebook pages/groups or Instagram accounts) that mentioned a variety of different named entities. Actor Nodes with high values of the out-degree centrality are usually actors who tend to post many posts, long posts or both.  

Another option is to use information about the number of followers each Actor Nodes (Facebook page/group or Instagram account) has (which is stored as part of your network data) to resize all Actor Nodes in accordance with their popularity based on the number of followers they have. To do so, select the [subscriberCount] attribute in the dropdown menu instead of in/out-degree centrality (as shown below). Note: Since Semantic Nodes don’t have “followers”, the size of their nodes will be set to the value of the Min size parameter. In our case, it is 5. 

4. Changing Node Color 

By default Gephi will display all nodes in the same color (usually black). To differentiate nodes of different types using different colors, go to the [Appearance] tab -> [Nodes] tab -> [Partition] tab, and select [node_type] from the dropdown; once selected, click the [Apply] button. 

In our case, Gephi assigned two different colors to Actor Nodes: Facebook groups (pink) and Facebook pages (black); and three different colors to Semantic Nodes: PERSON/personal names (blue), ORG/organizational names (orange), and LOC/locations (green). You can adjust the assigned colors by clicking on the colored square that appears in the panel  under the [Partition] tab on the left. This panel also shows the percentage of nodes for each node type. 

5. Displaying Node Labels 

To display node labels, simply click on the [T] icon at the bottom of the visualization as shown below. 

To reduce clutter in the visualization due to the presence of many overlapping labels, change the label size to reflect the [in-degree] centrality, similarly to what we did in Step 3 above when we change the node size. To do so, click on the [Appearance] tab -> [Nodes] tab -> [Ranking] tab, and also click on the label size icon [] to indicate that you want to change the size of the labels. 

Finally, select the [In-Degree] centrality from the dropdown menu and click [Apply].

6. Merging Related Semantic Nodes

When working with semantic networks, an important step is to merge Semantic Nodes that represent the same named entity but use different variations or spellings. 

To accomplish this step in Gephi, start by examining the most mentioned Semantic Nodes (These nodes will appear larger in the visualization) to see if any of them relate to the same person, organization or location, etc.

For example, in our example, we might want to merge FDA and US FDA nodes since both refer to the same government department – the U.S. Food and Drug Administration (see below). 

To merge two or more nodes that denote the same entity, go to the [Data Laboratory] tab, use the [Filter] option to search and select [Nodes] with names containing a given keyword or string of characters; in our case, the keyword is FDA. When you run this search, Gephi will look for both Actor Nodes and Semantic Nodes that includes ‘FDA” as part of its name (a.k.a Label)

You can use vertical bar character (|), usually located just above the [Enter] key on your keyboard, to search and filter by more than one keyword. In our case, by expanding our search to include “Food and Drug Administration” (final query: “FDA|Food and Drug Administration”), we found two additional Semantic Nodes that can be merged with FDA and US FDA (as shown below). 

Next, select all Semantic Nodes you want to merge by individually clicking on the corresponding rows in the Nodes table while holding down the [CTRL] button (or the Command button on Mac). Once selected, right click and choose “Merge nodes”.

Note: When merging nodes, only merge Semantic Nodes with other Semantic Nodes, and not with Actor Nodes. To confirm that a node represents a Semantic Node, look under the [node_type] column in the Nodes table; all Semantic Nodes will be characterized as a “NAMED ENTITY: xxx”.

The final step in this process is to select a label for the merged node. Using the drop down menu in the popup window that will appear after you click [Merge nodes], select the most suitable label for the merged node. A general rule is to select the most informative but shortest name so that it doesn’t cover many other nodes or other labels in the visualization. In our case, we chose “FDA”. 

After clicking the [OK] button in the previous step, any Actor Nodes (Facebook pages/groups or Instagram accounts) that mentioned either “FDA”, “US FDA”, “Food and Drug Administration” or “US Food and Drug Administration” will now all linked to a single Semantic Node labelled as “FDA” (as shown below).

Since it may be impractical to complete the merging step for all Semantic Nodes in large networks, we advise to focus on most mentioned entities. Step 8 below will show you how to do this. But before that, we will show you how to look for and remove False-positive Semantic Nodes.

7. Removing False-positive Semantic Nodes 

As noted in the previous tutorial, “CrowdTangle: Creating a Communalytic’s Two-mode Semantic Network with Facebook and Instagram Data (Part 2)”, the spaCy named entity recognition algorithms (produces results that are around 85% accurate depending on the language and domain of a dataset; as a result, some errors known as false-positives are possible. To further improve the overall accuracy of the output of the spaCy algorithm, we suggest users to also conduct a manual review of Semantic Nodes present in the network to make sure that they in fact represent actual named entities, and remove those that do not. A common source for potential false-positive results could be the presence of a message in your dataset that might have been written in a language that is not yet supported by spaCy. 

To perform this data verification and cleaning process, follow the steps below: 

  1. Calculate the Degree centralities by clicking on the “Run” button in the Statistics tab.
  1. Sort the list of nodes based on the in-degree centrality from highest to lowest. To do this, click on the [Data Laboratory] tab -> [Nodes] tab and click on the header of the column called [In-Degree] to sort all nodes by in-degree centrality values from highest to lowest.
  1. Then starting at the top of the list, begin to manually review values under the Label column in the nodes table to confirm that they represent named entities. 
  1. If you come across a false-positive result, right click anywhere in the corresponding row in the nodes table and select [Delete] in the context menu. The screenshot below shows how to delete an emoji of a person that was recognized as a named entity. 
  1. Continue reviewing labels for all remaining Semantic Nodes or if you have an especially large network with over 1000 nodes, you can limit your review to the top 5-10% of the most mentioned nodes since these are the nodes that you are most likely to use when examining and interpreting the final network visualization. 

While potentially time-consuming, the manual review of a sample of most connected Semantic Nodes and subsequent removal of False-positive Semantic Nodes (named entities) will significantly improve the overall accuracy of the network and will make it easier to interpret the network. 

Note: Make sure to record your steps during this and other data preparation and cleaning steps as these are important details to include in your future publication. For example, you might want to record the total number of Semantic Nodes (named entities) that you reviewed and the percentage of Semantic Nodes (named entities) that you deemed as false-positive and removed from the network. 

8. Filtering Nodes Based on In-Degree Centrality

If your semantic network has a lot of named entities (which is often the case), you can reduce the clutter in the visualization by instructing Gephi to hide labels for Semantic Nodes with only one incoming connection (nodes with the in-degree centrality value of 1). (Note: In instances where you have a very dense network, you might want to iteratively adjust the in-degree centrality cut-off value higher till you get a network that’s more manageable but still informative.) 

As a reminder, in-degree centrality shows the number of incoming links into a node. In the context of Two-mode Semantic Networks, Actor Nodes (Facebook pages/groups or Instagram accounts) will always have in-degree centrality equal to 0, and Semantic Nodes (named entities) will have in-degree centrality equal to the number of Actor Nodes that mentioned them in your dataset.

To show only Semantic Nodes (named entities) mentioned by more than one Actor Nodes (Facebook pages/groups or Instagram accounts), follow the steps below:

  1. Click on the [Filters] tab (usually located on the right side of the screen), then 
  2. Click to expand the Topology section, and 
  3. Drag & drop the [In Degree Range] filter under the Queries section below. 
  4. The two-side slider called [In Degree Range Settings] will appear showing the minimum and maximum values for the “in-degree” centrality measure. Nodes with in-degree centrality values that fall within this range will be visible in the network visualization. By default, the minimum value will be set to 0 and the maximum will be set to the most mentioned named entity (or entities) in the network. In our case, it’s the word “Bell” mentioned by 86 Facebook pages/groups. Change the minimum in-degree value to 2 by adjusting the slider.
  5. Apply the selected filter, click on the [Filter] button. 
  1. After applying the filter, it will hide any nodes (and associated edges) with an in-degree centrality value less than 2. And because Actor Nodes have an in-degree value of 0, they will also be hidden from the visualization, as will any Semantic Nodes (named entities) that were mentioned by less than two Actor Nodes. Finally, click on the [A]->” () icon at the top of the [Filters] tab to hide labels for nodes that are not in the filtered network, and then click on the [Stop] button to disable the filter. 
  1. After disabling the filter, all nodes and edges will become visible again, but only nodes with 2 or more incoming connections will have visible labels. By going through these data cleaning and filtering steps, we now have an unclutter network that can be used to  explore the connections between some of the more prevalent Actor Nodes and Semantic Nodes found in the dataset.