Get help from the best in academic writing.

Wave Function Collapse in Deep Convolutional Generative Adversarial Network

The procedural content generation algorithms are a well-known concept in the game industry. Because of their time efficiency, there is more effort put into developing new algorithms. The Wave Function Collapse algorithm developed by Max Gumin is populating the pattern from a small sample. The algorithm gained much popularity because of the variety of the outputs generated from only one input. This paper examines whether the wave function collapse can be trained within a use of Deep Convolutional Generative Adversarial Network to get the output starting with the input specified by the user. This would give additional control over the algorithm and allow to specify the spatial distribution of tiles across the solution space.
Keywords: Wave Function Collapse, Procedural City Generation, Machine Learning, Generative Adversarial Network
Table of contents Abstract
1. Introduction
1.1 The aim of the project
1.2 The structure of the document
1. Background
2.1 Wave Function Collapse
2.2 Artificial Intelligence application to Wave Function Collapse
2.3 Algorithmic Complexity
2. Approach 1: Generative Adversarial Networks
2.1. Introduction
2.2. Dataset preparation
2.3. Image encoding
2.4. Image-to-image translation
2.5. Image post-processing
2.6. Evaluation of results
3. Approach 2: Modification of probability and entropy
3.1. Introduction
3.2. Input image
3.3. Result measurement
3.3.1. Linear mapping based on the lowest and the highest weight
3.3.1. Exponential growth
3.3.2. Enforcing the patterns building desirable content
3.4. Entropy
4. Results
5. Conclusion
6. Appendices
7. References
List of figures
Figure 1 Something something
Figure 2 On the left: Input image that indicates areas with specific desirable values, on the right: the solution that corresponds to the input image
Figure 3 Examples of the 2D patterns generated by Max Gumin with a use of WFC
Figure 4 Island generated with Wave Function Collapse in Bad North
Figure 5 3D representation of input image.
Figure 6 Input image to the WFC.
Figure 7 Patterns generated from input image with overlapping model.
Figure 8 On the left: The output of simple tiled mode without rotation..
Figure 9 The result of PCGML
Figure 10 Pix2Pix Architecture
Figure 11 Image processing as the part of data preparation for machine learning
Figure 12 Result after first epoch with relu activation function
Figure 13 Result after first epoch with softmax activation function
Figure 14 On the left: input image, on the right: 96% accurate result
Figure 15 The process of content generation with wave function collapse, following the order of the minimal entropy.
Figure 16 The input image
Figure 17 The process of generating content with the order directed by values of the input image (Figure 16).
Procedural content generation (PCG) is the technique for algorithmically producing virtual content with minimal or indirect human input. The content in PCG refers to any virtual asset like game level, rules, textures, characters, maps, music or architecture (Togelius, Shaker and Nelson 2016). Emerging as the solution in the computer graphics (Ebert, et al. 2003), character animation (Wardrip-Fruin and Harrigan 2004), it influenced the most significantly the field of game design. An ASCII based games like Beneath Apple Manor(Worth 1978) or Rogue (Toy, Wichman and Arnold 1980)are one of the first known applications of PCG. In Maze Craze (Atari 1987), two players compete by escaping a randomly created maze, generated with every new game level. The method became attractive because of the storage space efficiency, which was one of the factors hindering game development (Amato 2017). After almost four decades of technological advancement since the first PCG game release, today’s game industry products amaze with the complexity, level of detail and realistically programmed atmosphere in the virtual world. The multiplayer online game World Of Warcraft (Blizzard Entertainment 2004)has over 1,400 different landscapes, 5,300 characters, 30,000 items and over 2,000,000 words of text. The production of those assets consumed five years of work (M. M. Hendrikx 2013). The technological progress resolved the device capacity limitations, while the real challenge now is the time and resources consumption during high-quality games production. The high cost of the process makes it impossible for small companies to develop the product independently using only manual methods for content generation. Therefore, the application of procedural generators to games is time, cost and resource-efficient and benefits in the wide variation of generated content. The reduced demand for human contribution makes the process more profitable and accessible for small and medium-sized enterprises, resulting in a higher diversity of published games. This is especially relevant because of the popularity of digital distribution services like Google Play, Steam or App Store that already made publishing easy and accessible to professionals as well as to hobbyists. The procedural content generation automates the process and restructures the workflows working in favour of more qualitative content generation tasks over the quantitative ones.
The numerous benefits have been driving developers to research complex and universal generators that potentially could create an entire virtual world. The complete landscape with architecture, a set of items and complex rules could be the result of one procedural content generation algorithm. Four decades of research still did not bring the solution that can produce the complete game content. However, there has been a significant advancement in the algorithms that are targeting the specific content category, rather than covering all topics. For example, the L-system has been applied as a road network generator with controllable parameters like road patterns and populace density (Parish and Müller 2011). L-systems and cellular automata are also conventional techniques for procedurally generated behaviour simulating physics like fireworks explosion or background characters movement (Hidalgo, et al. 2008). The phenomenon behind those methods is their simplicity of the logic and complex, unpredictable result. Although the rules are entirely deterministic, the outcomes are widely different due to the initial state. An object following a few simple rules can behave similarly to real-world objects, using at the same time a little computer power. The applications of L-systems and cellular automata range from small scale like plant generation, to a larger one like simple 2D game levels.
The procedural new level generation is one of the most researched applications of the PCG algorithm (Hendrikx, et al. 2011). The challenge of a level generation is to develop the variety between the levels, where each level is not a monotone repetition of the same pattern. The commonly used tools to disrupt the repetitious content are functions that randomise the values. Perlin Noise invented by Ken Perlin (Musgrave, et al. 2003) is a method that generates naturally looking textures that can be utilised by computer graphic designers to create content that resembles the real-life textures, for example, a realistically looking sky (Roden and Parberry 2005). While the functions like Perlin Noise successfully randomise the context, there are not necessarily applicable to the games that require a complex environment with semantics. The techniques applied to generate advance environment are trying to solve the constraint satisfaction problems. The popular PCG methods solving constraining problems are tile-based generators. The tiles are the ingredients of the content, and the constrains specify the relationship between the tiles.
Wave Function Collapse developed by Max Gumin received lately a lot of popularity because of the constrains-based solved that generated a different result from one sample.
Game creators benefit broadly from the use of procedural content generators. The algorithms are commonly applied in the game industry. Despite their popularity, they still face many challenges. Generated content usually looks nonexclusive and requires the purposeful macro-structure. Generated levels vary between each other, but after a few iterations, the player easily can notice the similarities. The lack of progression between the levels decreases the value of the game. The commonly used concept for game progression, are levels evolving towards broader and more complex content together with the progress of the player. To design that kind of progression, the creator needs to get more control of the algorithm. Procedurally generated games are usually not flexible to any additional input beside the set of rules. The meaningful and original content creation needs to have human input. Therefore, the generators should be able to respond to additional instructions added on the top of the set of rules.
This dissertation explores the techniques that could enable an artist to specify the spatial features of the game content and generate the result that satisfies the constrains.
1.1. The aim
This dissertation aims to explore two approaches to generate the Wave Function Collapse result with the additional requirements specified by the designer. Additionally, to the standard algorithm inputs, there is an image that will indicate the areas of desirable content. In the game context, that could be a river, island or building that follows the desirable distribution. In the presented example, the desirable content ( shown as black area in the input image) corresponds to the non-white cells. Adding the requirements on the top of the Wave Function Collapse, would fully use the benefits of the algorithm, enriching the result with the meaningful macro-structure. Controlling how the generated space looks, enables the artist to create varied complexities of the content and adjust the architecture to the specific level requirements.
Both techniques aim to use a simple image as an indicator where assets should appear. The result should also keep the wave function collapse constrains. The first approach is using Generative Adversarial Networks, and the second approach is modifying the probabilities functions inside the Wave Function Collapse.
Figure 1. Input image.
Figure 2. Result of the Wave Function Collapse.
1.2. The structure of the document
The first chapter introduces the concept and advancements of the procedural content generators and explains the current challenges. The background section describes the Wave Function Collapse algorithm that is the basis for this dissertation. That is followed by relevant research for this topic. The methodology is separated into two different chapters, as this dissertation explores two methods to achieve the aim. Therefore, the third chapter explains the machine learning approach, and the fourth chapter focuses on the second method, which is working with the probabilistic and entropy. The methodology of both approaches is followed by a brief description of results evaluation methods and the results from both methods. The overview and assessment precede the conclusion.
2.1. Introduction to Wave Function Collapse
Wave Function Collapse (WFC) is a procedural content generation algorithm developed by Max Gumin in 2015. The name Wave Function Collapse refers to the quantum mechanics process of changing the superposition of the wave function due to the presence of the observer. High entropy of unobserved state is decreasing proportionally with observing the particles. Once the state is observed, and entropy is equal to zero, the wave function collapses.
The Wave Function Collapse is a tile-based content generator. The input is a sample of the pattern that needs to be populated. The set of tiles is built from the tessellated pattern sample. Each tile has assigned other tiles as possible neighbours. The algorithm is a constraint-based solver which will not allow for any tile connections that do not match the neighbour settings.
The first application of the WFC was 2D patterns generated by the author itself (Gumin 2015). Examples were presenting the applications of the pattern in different sceneries like city elevations, brick layouts or abstract patterns.
Figure 3. 2D patterns generated with Wave Function Collapse by Max Gumin.
The publication received much attention on social media from game and computer graphic artist. Gumin presented the input and generated outputs, together with videos showing how Wave Function Collapse solves the space. During the pattern generation, the algorithm follows the minimal entropy heuristics. People also instinctively perform many activities like drawing with minimal entropy heuristics. That is why, the process of solving the pattern is enjoyable to watch (Gumin 2015).
WFC can be applied both to two- and three-dimensional space. In a 2D, one tile can be a pixel, image or a 2D surface. In a 3D, tile can be a voxel or a 3D model. Marian Kleineberg adapted WFC to create an infinite city assembled from 3D blocks like bridges, buildings and streets. The content continues to generate further in any direction indisposing the user walking through the city to reach the end of the virtual city (Kleineberg, 2018). One of the first games generated with the use of Wave Function Collapse is a Proc Skater 2016 (Parker, Jones and Morante 2016). It is a skateboarding game in which a player can enjoy numerous procedurally generated skate parks and save the favourite configuration. A real-time strategy video game Bad North (Stålberg and Meredith 2018)uses WFC in three-dimensional space to generate islands which are game levels. This publication attracted much attention not only with the use of WFC, but also with the outstanding aesthetics.
Figure 4. Island generated with Wave Function Collapse in the video game Bad North.
Since the first publication of the algorithm, the interest coming from both hobbyists and professionals led to numerous applications and modifications of the WFC. It also became a topic of academic research in the field of game design and procedural content generation.
2.2. Wave Function Collapse explained
Wave Function Collapse is the starting point of this dissertation. The algorithm was recreated based on the first academic publication concerning the WFC (Karth and Smith 2017). Karth and Smith describe the history and applications of the method followed by the explanation of each step with the pseudocode allowing to understand and recreate the algorithm. Understanding the algorithm is essential to follow the further part of this dissertation. Therefore, this section will be an explanation of each step of the Wave Function Collapse presented on the example that is also an input pattern for the experiments.
Input pattern sample
The input pattern is a sample of the content to be populated. It could be a bitmap where one cell is one pixel or a grid of elements where one cell is one 2D or 3D asset. The image used for this dissertation is a simple 2D image build from three types of components. The logic of the WFC is the same for every level of complexity. However, a more straightforward pattern is time-efficient. The pattern sample is a grid of a five by five size. Three types of components are filling the grid. In this particular example, the shape of the components is the same, and the difference is only in colour. For the clear referencing in this document, each component (also called a cell) has its letter code that will be used in the further parts of this dissertation.

Figure 6. Pattern sample encoded as letters.
Figure 5. Pattern sample.

Library of patterns from pattern sample
Gumin proposes two models for the pattern (also called tile) library generation: the simple tiled model and the overlapping model. The first method is tessellating the input pattern into the grid of size NxN. Commonly used size is 2 or 3. One cell in the grid becomes one pattern. The overlapping model extracts all possible NxN patterns. That means that a library generated with the overlapping method will have more members than the one generated by a simple tiled model. For both methods, the pattern can be reflected and rotated in order to build a more extensive library of patterns. The simple tiled model is more computationally efficient in the later stages of the algorithm. The overlapping model generates a more extensive library of patterns which results in the more diversified output. The impact of choosing one model or another is especially influent when dealing with straightforward patterns. For the intricate input patterns, the result diversifies with both models. With the simple pattern sample, the simple tiles model will generate a small library of patterns, and the outcome of the Wave Function Collapse can be less attractive.
Because of the simplicity of the input pattern sample used in this work, the library of tiles is generated using the overlapping model.
Figure 7. Library of patterns generated with overlapping model and rotation.

Figure 9. WFC result built from the overlapping model with rotation library.
Figure 8. WFC result built from the simple tiled model patterns library.
Each tile from the generated library has an assigned weight. The weight corresponds to the probability of this pattern appearing in the solution. The weight of the pattern is the sum of weights of the components that build this pattern. The component’s weight is representing the percentage of the sample image that this component is filling. For example, the sample pattern presented in the previous section (Figure 3) has 52 % of the A components, 20% of the B components and 28% of C.

Table 1. Weights of the patterns.
Overlapping neighbours
The next step is to assign a set of possible neighbours that can appear next to the pattern. For the overlapping model, each tile has (2( N – 1) 1)2 offsets to consider.
Each cell in the space that needs to be solved has its entropy. The entropy is proportional to the number of patterns that will satisfy the constraints in this location. As the more patterns appear in the solution, the entropy decreases. Similarly to the quantum mechanic’s concept of the information, the entropy can increase but never decrease, and the entropy of pure state is equal to zero (Nielsn and Chuang 2000). The values are calculated from Shannon’s Entropy (Shannon 1948) equation:

Where pi is the weight of the pattern. That means that a cell with only a few patterns possible will have smaller entropy than the cell where many tiles will satisfy the constraints.

Once the generated set of tiles has assigned a list of possible neighbours and weight, the next step is to select the first pattern to collapse. At this point, the entropy is equally high at every cell. Therefore the first location is usually set to random or specified by the user. With every next iteration, the selected cell for the new pattern has the lowest entropy. Once the location is defined, the next step is to pick which pattern will appear in this place. The patterns with higher weights have a higher probability of being selected.
sum = sum all weights from possible_patterns
random = get random number from zero to sum
current_sum = 0
go through every possible pattern from possible_patterns:
current_sum = current_sum weight of current pattern
if (current_sum) > random:
return this pattern as selected

Once to the selected location, there is selected a pattern. The next step is to update the entropies. The entropy of the selected location drops to zero, because it is an observed state. Then, for each of the overlapping positions, the entropy needs to be recalculated, as the patterns that cannot be neighbours to the newly selected pattern needs to be blocked. Therefore the entropy of those locations also decreases.
The observation and propagation steps repeat until the whole space is not solved.
With every iteration, more patterns get blocked, and the entropy decreases. It may happen, that in a location where there is still no pattern assigned, the number of possible patterns will be equal to zero. This circumstance is called a contradiction. In Gumin’s version of Wave Function Collapse, if the process gets to this point, it resets and starts from the beginning. In the later adaptations of the algorithm, with the backtracking the step that causes contradiction can be ‘erased’ so different pattern can be assigned. The backtracking does not solve the problem of contradictions completely; however, in many cases, it reduces it significantly. Wave Function Collapse has known scalability limitations (Scurti and Verbrugge 2018) and more complex input patterns, and larger output pattern rapidly increases the computing time mainly because of the growing number of contradictions.
The describes guidelines above are the underlying implementation of the Wave Function Collapse with one additional element – backtracking. However, the algorithm is often modified to adjust the functionality to a specific problem.
2.3. Machine learning and Wave Function Collapse
With the growing interest of the PCG algorithm such as Wave Function Collapse researchers are exploring new methods for creating valuable content with minimal human assistance. Procedural Content Generation through Machine Learning (PCGML) is a novel concept of a game content creation using machine learning models trained on existing content (Summerville, et al. 2017). The recent research in PCGML concentrates on reproducing the game assets to provide the numerous variations of the virtual environment, trained on the previous examples. However, the more design endeavour invested in delivering high-quality training data, the lower the payoff of applying the PCGML in the first place. Karth and Smith propose implementation of discriminative model into the modified version of Wave Function Collapse, where the model learns to assess whether a generated content is acceptable. Both negative and positive examples of the generated patterns feed the discriminator paired with WFC examples (Karth and Smith 2018). Through incrementation of the inputs ( Gumin’s WFC allows for one input) artist feed the algorithm with examples of patterns that did not appear in the sample pattern but would be a positive variety to the original pattern. This technique requires more human contribution that original WFC model, however, it encourages the artist to make changes in the pattern by adding new features rather than recreating the new sample pattern.

Figure 10. Karth and Smith’s example of the WFC combined with the discriminator fed by positive and negative examples.

2.4. Controllable procedural content generation
I found a paper “Controllable Procedural Content Generation via Constrained Multi-Dimensional Markov Chain Sampling” so it is not exactly wave function collapse, but maybe it is relevant, so maybe I will shorty write about it.
2.5. Algorithmic complexity
Not sure if this is still relevant, but maybe short paragraph about that the problem is combinatorial, therefore it is very complex and contrained etc, so it is hard to make it flexible.
The method I: Generative Adversarial Networks
3.1. Introduction
The first proposal to generate Wave Function Collapse result that responds to the additional input image, is to use a machine learning model to train the model on the collected WFC results data. The model should be supplied with both the input image and the right WFC outcome, in order to train the relationship between each other.
A Generative Adversarial Networks (GAN) is a machine learning model that concurrently train two neural networks: a generative and discriminative one (Goodfellow, et al. 2014). A generative model (called Generator) learns the data distribution and attempts to produce a data resembling the training dataset. Simultaneously, a discriminative model (called Discriminator) learns to distinguish whether the data comes from the generator or the actual dataset. GAN, described as the most exciting idea in the last decade of machine learning (LeCun 2016) received much attention from research, which led to multiple variations of the model. The image-to-image translation, additionally to the mapping from the input image to the output image, also learns a loss function to train this mapping (Isola, et al. 2016).
Figure 11. Two examples presented in (Goodfellow, et al. 2014). The right column shows the results of the generated image after the training.
Describe the architecture of image-to-image translation
3.2. Dataset preparation
The training, testing and validation data for the image-to-image translation are the results of the Wave Function Collapse outcomes. The dataset contains 1200 pairs for training, 400 for testing and validation. Image-to-image translation can produce satisfactory results with the small data size of around 400 images. However, the type of images trained in this example varies from the commonly applied dataset. Each image in the dataset is a pair of the input image and the desirable WFC outcome ( see Figure 1 and 2). The black and white input image is a processed Wave Function Collapse result image. The first iteration removes components B and C that are surrounded only by white A components and replace them with A. Then, the gaussian blur masks the details of the image, and extract the macro-structure of the pattern. The last step is a mask that converts the pixels of the brightness below 0,7 to black, and brighter than this threshold pixel to white.

Figure 13 Image processing as the part of data preparation for machine learning

3.3. Image encoding
The images usually trained in GAN model are photographies or drawings. The changing values of the pixels next to each are the gradients that usually mean this is an edge of the shadow. In the case of WFC output, the pixels have a different type of relationship and should not be considered as a gradient. Therefore, to differentiate the values, rather than assigning values as RGB or brightness values, each pixel is converted to the one state. Explain one-hot encoding.
The pattern has three different values. Each value has its combination of one 1 and zeros that is unique for this colour.

[ 1, 0, 0 ] [ 0, 1, 0 ] [ 0, 0, 1 ]
The one-hot encoded pattern is an array of four corresponding to the cell colours values:
[ [ 0, 0, 1 ] , [ 0, 1, 0 ],
[ 0, 1, 0 ] , [ 1, 0, 0 ] ]
The colour value encoding, follows the information about the cell’s neighbours. That means that six values represent one cell: the first three are colour codes, and the last three are the sums of
3.4. Image-to-image translation
The pix2pix model is using the TensorFlow libraries
Generator architecture (especially downsampling, upsampling, and activation functions)
Discriminator architecture (same as in generator)
Method II: Probability and entropy
4.1. Introduction
The second approach aims to generate Wave Function Collapse pattern that reflects the geometry from the input image, following the steps of the original algorithm. As opposed to the first method, this proposal focuses on fully satisfying the constrains. Therefore, the core logic remains the same, and the focus is on the modifications of weights of the patterns and the entropies. By increasing the weights of desirable components in the areas corresponding to the black pixels in the input image, the probability of WFC choosing those patterns increases. This approach presents three different functions for weight recalculation, tested on the set of different values.
4.2. Input image
The input to the algorithm is the pattern sample (the standard input to the Wave Function Collapse, see Figure x) and the second input which is black and white image (see Figure x). Input image describes a macro-structure of the desirable WFC output where black pixels corresponds to the cells B and C (darker colours).
4.3. Linear mapping of the probabilities
The first approach is heavily relying on the original weights calculated from the pattern sample. The values of each cell (A, B and C) exchanges between each other, so the higher weight, previously assigned to white cell A, becomes a weight of the cell C (the darker one). Tile B has medium brightness, and the weight stays the same.
Because the structure of the tiles is different, and while in the original set the highest tile was built from 4 white components, with the new weights his highest value cannot be obtained because the patterns are built in a way that there is no cell with four dark green cells. To keep the same domain of the values, each new pattern weight is mapped into the original weight domain.
The initial step to recalculate the weights was to reverse the existing proportions. The highest weight initially assigned to white cell (0.52) now is assigned to the dark green cell. The middleweight (0.28) is for light green cell and the lowest (0.20) for white cell.
In the chart (see Table 1), the method I are values of reversing the weights, the method II presents the same weights mapped into original pattern weight domain.

Table 2 Comparison of the original weights of patterns, and the modified according to the method 1.
Because the domain stays the same, and the proportions of the weights corresponds to the original values, this will result in similarly random output as the one generated with weights calculated from input image.
The constraints in the algorithm are overwhelming the changed weights. Even giving more probability to the dark cells, the structure of the input image and the dominance of white cells influence the type of the set of tiles that is generated in the first place. Therefore, changing the weights within the same domain, is not effective because the generated set of tiles and natural proportions of input image predefine the pattern.
Because of that reason, the weights have been recalculated enforcing the weights of the green cells to measure what is the result based on those weights. The tested set of weights increase iteratively the light green and dark green values and decrease value [0].

Table 3 The patterns weights based on different cell colour weight values
3.3.1. Exponential growth
The second approach is enforcing the green cell exponentially, not linearly. To force the algorithm to put the darker cells, the weights will be calculated exponentially. That means that the difference between white pattern, and dark pattern will be significantly different.
if the sum of cells [1] and [2] in the pattern is:
return pattern_weight
return pattern_weight2
return pattern_weight 3
return pattern_weight 4
The same sets of different weights calculated with exponential growth rewarding tiles with less white space, results in higher domain of output values.

4.3.1. Enforcing the patterns building desirable content
The third approach looking closely to the structure of the regions in the solution pattern that are filled with colourful cells. From the data set that was prepared for the machine learning approach, the occurrence of each pattern in the extracted area is the value feeding this approach.
run wave function collapse
process the image
get the dark are of the image
calculate pattern occurrence in the dark area

Table 4 Pattern occurrence in the areas that are marked black in the input image. The chart visualises 30 iterations, which clearly shows the tendency of which patterns are components of successful run.
The results of the pattern occurrence show a clear selection preference. The average of one hundred iterations is the multiplier for the pattern weights values.

Table 5 The average number of pattern occurrence in the dark areas of the image.
4.4. Entropy
The high entropy at the beginning of the solving the output pattern is gradually decreasing in pair with succeeding patterns collapsing. After each collapsed pattern, the location with the lowest entropy without assigned pattern is selected as a next location to be solved. Because of the constraints and the entropy order, once the algorithm naturally gets to the areas marked as black, it is usually already party determined by the collapsed patterns, which pattern can be placed to keep the constraints satisfied. To increase the change of solving the space with desirable patterns, the cells corresponding to the black space have priority and are solved first. The benefits of that are highly dependent on the complexity of the input image. For the more complex shapes, the forced order of collapsing patterns more often results in the contradictions.

Figure 16 The process of content generation with wave function collapse, following the mimimun entropy heuristics.

Figure 17 The input image
Figure 18 The process of generating content with the order directed by values of the input image (Figure 16).
The natural order of collapsing patterns follows is dependant on the structure of the built space. The algorithm first solves the areas with non-white space. Therefore those tiles are the most constrained. With the input image, algorithm still prioritizes lowest entropy locations within the black image’s area, after that it does the same with the areas outside.
4.5. Result measurement
Figure 15 On the left: input image, on the right: 96% accurate result
For 129 cells:
36 x [0] (white)
49 x [1] (light green)
44 x [2] (dark green)
For this example, the success rate is: (49 44)/129 = 0.72. Considering that the maximum that can be obtained is 0.75, this model is 96% accurate.
The probability of appearing pattern in the space partially results of the pattern’s weight, partially due to the random factor. This randomness is the only element of the whole algorithm which is not deterministic. With each iteration, once the next cell to solve is calculated, all the patterns that could appear in this cell are passed to the function that will decide which pattern will be assigned. Modification of the legal patterns would cause the incompliance with constrains, therefore only probabilities are modified. The probability directly results of the weights of the pattern. Therefore, when WFC is running with the image, the weights are recalculated based on the image values.
The original weights are: 0.52 for the white cell, 0.20 for the light green cell, and 0.28 for the
The results are presented based on the simple shaped input image as it enables them to verify whether the proposed solution has successful results. The methods of result evaluation are adjusted to each technique.
5.1. Methods of evaluation
5.1.1. Analysis of the results in the areas that user-specified as desirable for specific pattern to appear
This method is analysing the cells used to build the area marked as black in the input image. The term success rate represents the proportion of the non-white values to the sum of all cells in that area.
The method I: Generative Adversarial Networks
Rescale the result, run the same operation as on the method II files.
Method II: Modification of probability and entropy
1. Linear mapping
Chart – (data exported)
2. Exponential growth
Chart – (data exported)
3. Enforcing patterns building desirable context
The chart visalising both methods together, and after that summary:
The results of the image-to-image translation usually gives very high rates. Because the neighbour relatioship of the wfc is not introduced to the algorithm, the input image has much higher priority comparing to the second approach.
The resutls of the second approach is changing according to the method calculating the weights. As predicted, the rate is increasing, when the weighs of pattrern B and C gets higher, while weakening the pattern A. The exponential weight calculation is reaching up to xx success rate. The method III gave the highest results with the xx success rate. The understanding of the structure of the pattern rather than manipulation of the single components gave more promising results.
Evaluate also white spaces
5.1.2. Image distance (vectors, distances etc etc):
Image distance
Colour, brightness ???
Process image (like to pix2pix) and then check image distance Constrains check -WORKS

This evaluation method checks whether the content generated using Image-to-Image translation keeps the constrains of the wave function collapse model. All dataset had been tested and results can be read in the table 12.
The proportion of patterns from the library to the new illegal patterns decreases with the progress of the neural networks. This is howerer only tool to partly evaluate the result. With the first new epochs, all the patterns exists in the library, however the picture is very limited and it is builded mainly from the empty patterns. In the early epochs 100% of patterns is corrent, the number drops to the 66 % in the latest epochs.
5.1.4. Pattern occurrence in GAN – WORKS
The pattern occurrence calculates how many times each pattern from generated library of patterns appears in the GAN result. This gives better understanding of the table with pattern constrains and explains the phenomena of high constrain kept rate even in the early epochs.
5.2. Comparison of results
Maybe some chart that can fit all the results? Or only summary?
Maybe summary of method I and then summary of method II and compare them?
Gachagan, Anthony , Chigozie Enyinna Nwankpa, Winifred Ijomah, and Stephen Marshall. 2018. “Activation Functions: Comparison of Trends in Practice and Research for Deep Learning.”
Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. “Generative Adversarial Nets.” Advances in Neural Information Processing Systems 27 .
Isola, Phillip , Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. 2016. “Image-to-Image Translation with Conditional Adversarial Networks.” IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Karth, Isaac, and Adam M. Smith. 2018. “Addressing the Fundamental Tension of PCGML with Discriminative Learning.” Procedural Content Generation. San Luis Obispo.
—. 2017. “WaveFunctionCollapse is Constraint Solving in the Wild.” Proceedings of the 12th International Conference on the Foundations of Digital Games. Hyannis, Massachusetts, USA: Adventure Works Press.
Kleineberg, Marian. 2018. Github. 15 July. Accessed 2019.
Scurti, Hugo, and Clark Verbrugge. 2018. “Generating Paths with WFC.” Proceedings of the Fourteenth Artificial Intelligence and Interactive Digital Entertainment Conference .

Weather Prediction through Sentiment Analysis on Twitter and Multi-Dimensional Data

While it is strongly believed in the psychology that weather has some influence on a human being’s mood, the discussions regarding their interrelations have been taking place from a long time. This project aims to study this long lasting discussion through sentimental analysis of data compared from regular psychological area: Twitter and regular weather forecast from forecast links. Analysis performed on the twitter data obtained by twitter API which is collected with respect to the attributes of multi-dimentional data and tries to reveal the correlations between tweet and multi dimensional weather data. Also this project aims to predict the weather based on neural combinational associations.
A human being’s physical, psychological, and economic well-being are supported by their mood and emotional state. Biological factors such as cortisol levels and cardiovascular functioning are related with positive emotions. These factors influence social involvement, support and may amplify economic success. Social platform reflects several emotional states and individual emotions in an elaborated manner. Limitations of small sample sizes and generalization were noticed while studying relations between mood and weather. [40]
The term weather is used to illustrate day-to-day variations in our atmosphere. This includes temperature, humidity, wind speed, wind direction and atmospheric pressure, among other variables. The climate of a locality is characterized by examining the weather statistics to achieve assessment of the daily, monthly and annual means, medians and variability of the weather data. Climate is, therefore, a long-term average of weather [15].
Weather data is collected and stored in the datasets. These datasets contains the information about data combination of humidity, temperature, rainfall, radiation, snow depth, vapor pressure and wind speed, air pressure, sun light intensity etc., for improving the prediction we need historical datasets that refers huge amount of data sets collected from different sources (big data), to process this data, we need new hardware and software with tools and techniques [1].
Various methods like Radial Basis Function Network, BPA (Back Propagation Algorithm), SVM (Support Vector Machine) and SOM (Self Organization Map) reviewed in [12], states that many of the researchers used BPA for weather prediction. In [5] the authors have reviewed various rain forecasting models by NN (Neural Networks) like FFNN, RNN, and TDNN. The survey shows these are compatible to predict weather forecasting techniques such as numerical and statistical models. Neural networks give better results for yearly basis data, but they gives poor performance for daily and monthly data. In paper [14] the authors Shoba G. et al. are investigated the different methods like ANFIS (Adaptive Neuro Fuzzy Inference System), and SLIQ Decision tree for rainfall forecasting. Balamurugan et al. [15] differentiates data mining algorithms like Decision Tree, KNN (K-Nearest Neighbor), Neural Networks, and Fuzzy Logic for rainfall prediction. They come up with conclusion that neural networks giving better results. Anshal Savla et al. [7] discussed different algorithms-based classification techniques of data mining such as SVM (Support Vector Machine), RF (Random Forest), NN, Bagging and REP Tree. Finally, they concluded bagging classification method is the best to predict the rain fall forecasting.
1.1 Objective
– Attribute based sentimental analysis on Twitter data and multi-dimensional data obtained from weather forecast links.
– Predicting the weather forecast using weather labels, based on the Sentiment analysis result.
1.2 Problem Statement
Sentiment analysis prediction on twitter as well as the multi-dimensional data with respect to attributes obtained in the weather data is quite complicated as there might not be any relevant attribute related mentions in twitter.
Thus, there is a need of weather data from official weather forecast open access sites with gateways. Both these data sources (twitter and multi-dimensional) would not match. Thus, a preprocessing is required for the analysis.
During this process, the multi-dimensional data is converted to XLSX, as the data processing algorithm requires raw data with respect to the native API. Similarly, twitter data filtration is required as data needs to be in a numerical format, and not textual with a filtration challenge in the form of attributes.
Related Work
2.1 Methodologies used for weather forecast based on different types of time period
On weather forecasting different researchers proposed their different methods and models to early prediction of weather conditions. Here weather forecasting approaches are four different scales based on period of time:
– long scale is yearly,
– medium scale is monthly,
– short scale is weekly and
– very short is daily.
Models have been developed in papers [8] and [18] for forecasting long term data. The work proposed in papers [4], [10] and [28] have developed models for forecasting medium term data, while the work in papers [3], [16] and [27] talk about models developed for forecasting short term data. For forecasting daily basis data (very short-term data) models have been developed in papers [1],[2],[5],[7],[19],[20] and [21].
Map Reduction Algorithm:
Bendre and Thool [1] proposed map reduce algorithm to predict weather conditions on daily basis using ICT services in agricultural big data environment to collect huge amount of data. They generated data from KVR (Krishi Vidyapeeth Rahuri) Weather station and analyzed on daily basis. They conclude as this approach is used to escalation the accuracy of the weather forecasting system by using various weather parameters for the future precision farming.
SVM and FCM:
Sanjeev Kumar Singh et al. [2] proposed prior recognition of tropical cyclones (TC) using global model products. They inspected 14 TC’s developed in NIO (North Indian Ocean) in between 2008–2011. They attained forecast fields at 6-h intervals up to 120 h, ahead formation of a cyclone over the NIO domain during 2008–2011. In next continuation they apply above methods on non-developing systems also for complete validation and for further enhancement in future. Kulwarun Warunsin and OrachatChitsobhuk [10] exhibited attainment of early cyclone discovery system based on wind speed and wind direction. They proposed SVM classification and FCM clustering techniques to identify the early cyclones and concluded as FCM offers highest accuracy of 93% and SVM produces poor results due to outliers.
Predictions based on Atmospheric Computer Models:
Takemasa and Keiichi and Koji [3] proposed numerical weather prediction based on computer models of atmosphere. In this method they synchronize the computer simulation with the real world is essential to accurately determine the atmosphere’s current state in six-hour interval circle. This method is not suitable for more magnitude samples, observations, and higher resolutions.
Huang Qing et al. [8] proposed China-CGMS. They analysis daily basis data in china using regression and Scenario analysis and they made early predictions of crop for administrative sectors. They concluded as constructing more yield type’s calendar and input parameters by using in detailed soil analysis and weather datasets to increase the adaptation of China -CGSM in near.
Wavelet ANN and Wavelet postfix- GP model:
V.Dabhi et al. [29] proposed weather prediction system for daily basis using Wavelet ANN and Wavelet postfix- GP model [12][13]. By daily basis we are not predicting accurate values.
2.2 Issues involved in weather forecasting
Due to the large volumes of weather data sets, conventional models do not give accurate results. To increase the accuracy of the system, the straight storm formation prediction system is required. NWP techniques cannot solve the prediction local weather conditions because they are unstable [25]. Statistical models also cannot produce great results because they produced based on assumptions [6]. Here are four different types of weather forecast methods [5], they are:
– Very short-scale forecast: hours basis (1-5)
– Short-scale forecast: 6 hours – few days (week basis)
– Medium-scale forecast: months basis (1-10).
– Long-scale forecast: Yearly basis
The challenges of Long, medium, short-term and very short-term data are as follows. In long term weather datasets, there is no simple process for determination of the weather input parameters. Too many or too few parameters can affect due to long time period (years). It is difficult to use same prediction model for short time because input parameters are changing on a daily or weekly basis. Therefore, a changed or newly added parameter does not fit in to the model which is already developed. Long term forecasting data is dependent on a period of sampling of input data. For long term data, if the training dataset is huge then it gives better results. Distortion and noises associated with the random variations of input parameters are possible in very short term or short-term weather datasets. Therefore, daily or weekly data may not provide accurate results. In comparison, monthly data provide results that are better than weekly data and results produced by yearly data are far better than monthly and weekly [16][27].

Fig 1: Types of Weather Forecast Methods
2.3 Sentiment Analysis
In the recent years, sentimental analysis has been the most utilized evaluation technique. The reason behind such desirability is due to the use of NLP, Biometrics and Text analysis to evaluate an emotional state of an individual. Sentimental Analysis aids in acquiring various information including emotion, opinion [33].
Since Social Media platforms have been governing the recent years, there certainly comes a requirement of an automated system for analysis. Consider social media platforms as the data sources, user gets access to large amounts of data to analyze and make decisions [33]. However, this becomes burdensome when preprocessing the manually. Hence sentimental analysis plays a vital role in providing the user with automated systems to analyze the data sources available.
The results of Sentiment analysis are based on the attributes. The tools used for this process assist in matching the attributes with the opinions of human emotions. It also involves collection of information based on the desired keywords, emotion from data sources (Twitter as used in this project) [34]. In addition to the extraction of data, theses analysis tools also pilots in prediction and can be exploited in different areas of fields. Several investigations conclude that sentimental analysis is widely used and published in many research papers and has been taking turns on the web world.
Sentimental Analysis encompasses several techniques which can be utilized in the fields of Business and general analysis. Several approaches and application comprising Scaling Systems, bales Interaction Process and Subjectivity/Objectivity Identification has been conferred on the paper [37]. The above-mentioned analysis has been entitled as Machine learning; NLP; Text Mining approach and Hybrid approach by the author.
Approaches of Sentiment Analysis:
Generally, Lexicon and Machine Learning approach are two eminent methods of sentimental analysis. The popularity of these methods is because of the type of result they produce irrespective of the fields they are used.
 Lexicon Approach:
Over the years, Lexicon approach has been implemented for various studies to perform Sentiment analysis. Lexicon approach has been portrayed as list of words with a score which conclude them as a positive/negative/objective in nature. According to [38] Lexicon approach uses Textual opinions to calculate the sentiment polarity. Novel Machine Learning approach, Ensemble approach and Corpus learning are considered as widely used Lexicon approaches for sentiment analysis.
Lexicon Based approaches have its own share of advantages and disadvantages. The major advantage being getting the data without any preparation. The functionality of this approach is based on extracting the positive and negative words in a sentence and thus the extraction or collection of data is accomplished by the pre-defined list of words
Though they are considered to be easily implemented, there are several disadvantages to the approach. Lexical based approaches find it hard to understand the slangs used in the social media sites [39]. Another disadvantage includes the creation of a predefined list of words also called as Lexicon based dictionary) wherever the approach is instigated. Hence this clinches the reason why they are not suitable for the modern language sets.
 Machine Learning Approach:
The intention of the approach lies in extracting the sentiment polarity based on data sets. The machine learning approach stays ahead as they are capable of adapting with the aid of linguistic features and ML Algorithms. ML Algorithms works with both supervised and unsupervised methods. Some of the most widely used methods based on this approach for sentiment analysis are: Support Vector Machine, K-Nearest Neighbor, Naïve Bayes, Neural Network, etc.
As we compare the leads of both the approaches, Machine based approach edges Lexicon base approach as they are capable of adaptation towards the context of study. Due to its adaptation capability, Machine based approach do not require a specific set of keywords or dictionary [38]. The ability of handling multiple languages, providing high accuracy makes machine-based approach advantageous.
The other side of Machine based approach is the need for a set of labelled data with respect to new data which leads to its reduced applicability in this context. Models trained on text in a specific field will not be attuned in another field. Nonetheless ML based approaches have a knack to classify and provide better sentiment analysis as compared to other approaches [38].
2.4 Sentiment Analysis on Weather
Weather forecasts are predicted by collecting vast data about the attributes that include Temperature, Wind speed, Wind direction, Air Pressure and Humidity. The availability of vast data can be chaotic and would lead to a less accurate predictions for future. Hence this project performs data acclimatization and utilizing it in concurrence with threshold and keyword-based filtration to predict the weather forecast [35].
3.1 High Level Architecture
The following figure shows the high-level architecture of the project:

Fig 2: High-Level Architecture
The architecture shows two kinds of data that are involved in the analysis: Twitter data and Multi-dimensional data. Both the dataset obtained are put through various stages before they arrive at a prediction. Initially, they are preprocessed and then segregated month wise (May, June and July). At this stage we have 6 separate data files, 3 data files for twitter, for each of May, June and July, and 3 data files for Multi-dimensional data, each of May, June and July. A sentiment analysis is performed based on the attributes, i.e. weather variables, and the attribute which is most dominant is determined (highest frequency). Thus, for each Month, a pair of attributes are obtained. This pair of attributes is compared against the standard weather labels with all neural combination of attributes from twitter and multi-dimensional data. Based on the combinations, the result is a predictive quantitative output based on prediction with neural combinational associations.
3.2 Data Collection:
This project involves two kinds of data:
– Weather related data collected from Social Media Website such as Twitter
– Weather data collected from weather forecast websites
Both the data are collected for Location: Edmonton, Canada.
3.2.1 Multi-Dimensional Data Collection:
The entire monthly wise data is collected from the web weather gateways like open access weather crawling sites. From these sites, data is gathered with input parameters and depending upon the site access and attributes the data is populated month wise in the form of CSV format. This kind of multidimensional population method is not single time but multiple access mechanism as the data is month wise with attribute-based filtration model. Here the gateway code is written in python and by filtration we populated month wise data with CSV. Some missing data was manually filled by obtaining data from the official Canadian government weather website [36]. CSV is then converted into XLSX format as the data is populated to java where the application works on XLSX with POI (Poor Obfuscation implementation) API.

Fig 3: Python script for data collection from open weather API
3.2.2 Twitter Data Collection:
The social media consist of 3.5 billion posts in total, with 2.4 billion from Twitter. Twitter data is more likely to consist of text expressions revealing the user’s underlying emotional state, while it also allows additional investigation into the mechanisms underlying the changes in expressed sentiment and to compare the effect sizes to other events.
Data Collection in Twitter allows developers access to a range of streaming API’s which offer low latency access to flows of twitter data. For the data collection implementation, the public streams API was used, it was found that this was the most suitable method of gathering information for data mining purposes as it allowed access to a global stream of twitter data that could be filtered as required. In order to take advantage of this stream, a java interface library had to be installed this library was necessary for java to interface with twitters API v1.1. For this task there were a number of libraries available. Java twitter tools v1.14.3 was chosen as it allowed the basic filtering and streaming functionality required for this project. Twitter has numerous regulations and rate limits imposed on the API. For this reason, it is required that all users must register an account and provide authentication details when they query the API. This registration requires users to provide an email address and telephone number for verification, once the user account is verified the user will be issued with the authentication detail which allows access to the API. A Java script was then created which provided the API with the authentication details and initialized a streaming process where data could be pulled from twitters RESTful web service to a local machine. A filter function was used to allow the program to request twitter content based on specific keywords related to this specific study. All the downloaded data was transmitted in JSON format, it was found that this standard was less verbose than the alternative format that was offered XML.
Each JSON formatted package contained a large amount of information but it was decided that for this project only the tweet and the time the tweet was written was required. In order to remove the unwanted content each package was parsed using a java script which located the useful content and stored it in RAM until main memory storage became available. An additional check was performed to ensure all the tweets downloaded were written in the English language. This check involved parsing the JSON content for a ‘Lang’ tag and then performing an equality check on its content.
Thus, once the required content was removed from the JSON package and stored in RAM it now could be written to main memory. There were many options on how to store the information such as, comma separated values (CSV) file, a text file or in a dataset. It was decided that the optimum approach was to use a text file. A dataset was created with a simple able structure which had the fields priority attributes. The priority attribute was automatically generated by simply incrementing a counter each time the dataset was written to.
3.3 Data Preprocessing:
Data preprocessing is done to eliminate the incomplete, noisy and inconsistent data. Data must be preprocessed in order to perform any data mining functionality.
3.3.1 Multi-Dimensional Data Preprocessing:
The data collected from open weather API provided data only for some days of the month. Due to API exception error, there were some missing data. These missing data were filled manually by obtaining data from the official Canadian Government weather website.
The final obtained dataset consists of 5 weather related attributes:
 Temperature
 Humidity
 Wind direction
 Wind Speed
 Air Pressure
3.3.2 Twitter Data Preprocessing:
Twitter is a real time information network that connects an individual to the latest climatic conditions and news about what they find interesting. This can be done by simply searching for accounts which are found most compelling and following their conversations and tweets. At the heart of Twitter are small
postings of information called Tweets. The length of each tweet is 140 characters long. Emoji’s, photos, videos and conversations are directly visible in tweets, which provides the whole story at a glance, all in one place.
By using java IO streams, all tweets are populated to memory. The tweets collected contain a mix of sentimental tweet with attribute data and numerical representation. By using following methods data is preprocessed and cleaned.

Fig 4: Preprocessing of Twitter Data
 Collection and filtration of tweets according to weather, with respect to keywords.
 By using the threshold, the keywords pre and post words, the data is filtered, and the remainder data is treated as noisy data.
 The above process is done by tokenizing each tweet into fully qualified words with cleaning process.
 Removing URL’s: Some tweets contain URL with single token which start with http://, https:// and www://. They are cleaned through the use of regular expressions.
 Question words such as what, which, how, etc, do not contribute to polarity. Hence, in order to reduce complexity, such words are removed.
 Special characters like.,[]{}()/’ should be removed in order to remove discrepancies during the assignment of polarity. For example, “it’s good:”, if the special characters are not removed sometimes the special characters may concatenate with the words and make those words unavailable in the dictionary. In order to overcome this, special characters need to be removed. By using numeric data discovery, the data will be surrounded with [90.89] as an example for further calculation purpose.
 Retweeting is the process of copying another user’s tweet and posting to another account. This usually happens if a user likes another user’s tweet. Retweets are commonly abbreviated with RT.” For example, consider the following tweet: “Horrible weather with 33 degrees temperature :)”. This tweet will be considered as temperature with [33], where [33] here is the numerical value associated with attribute temperature.
 After populating the data, the tweets will be extracted along with numerical data based on all the 5 attributes (Temperature, humidity, wind speed, wind direction and air pressure), as in the multi-dimensional data.
3.4 Data Loading:
3.4.1 Multi-dimensional data loading:
Apache has provided POI (poor obfuscation implementation) API. Apache POI, a project run by the Apache Software Foundation, and previously a sub-project of the Jakarta Project, provides pure Java libraries for reading and writing files in Microsoft Office formats, such as Word, PowerPoint and Excel.
The POI API is used to load the excel sheet into java memory, where a XSSFWorkbook is created and Microsoft excel sheet loaded is recreated.
3.4.2 Twitter data loading:
The twitter data is loaded into java, by calling the File Input Stream.
3.5 Implementation
This project is implemented using java with swing API. The following figures show the flow of implementation. The below figure shows exact flow of the work. Initially, data is collected from two sources: twitter and gateway source (from websites with gateway access through language API). The collected data is cleaned by removing unwanted data as, the data needs to be relevant to weather model with numeric data, while also it should be relevant to weather with proper lists which are specific to attributes like:
– Temperature
– Humidity
– Wind speed
– Wind direction
– Air pressure.

Fig 5: Implementation
By using back propagation technique and mean weighted average vector the threshold will be created, as the threshold depends on the locality. Upon obtaining the output attributes per month, those attributes which is affecting the weather that month will be predicted with relevant labels, so that, in the future if the weather from the same information is needed, the label will be the prediction.
3.5.1 Flow Chart

Fig 6: Flow Chart
3.5.2 Packages
Many java class files and related metadata and resources are bundled into one file for circulation, in the form of a package file format, Java Archive (JAR). The following figure shows the JAR files used in this project.

Fig 7: JAR files
3.5.3 Attribute definitions
allTweets() : An array list used to accommodate all tweets in one single variable
allAttributes() : An array list of all attribute names as labels with comparable keywords.
at1(), at2(), … at10() : Array list of 10 variables to store individual attribute values from the multi-dimension data (excel sheet).
availableAttributes() : Array list containing the attribute names with respect to multi-dimensional data.
allAttsVals() : Array list to store the individual attribute’s threshold value calculation result.
3.5.4 Calculation for Mean Weighted Vector
The following figure shows the code used to calculate the mean weighted average vector of all the attributes per month to determine the maximum valued attribute for the month.

Fig 8: MWV calculation
The maximum attributed value with attribute is determined by:

Fig 9: Maximum attributed value calculation
The accuracy is calculated by:

Fig 10: Accuracy calculation for July
The accuracy is calculated by dividing the maximum attributed value by total number of tweets which are posted on that attribute.
For example, if humidity is the maximum attributed value for that month, the accuracy would be calculated as: 78/123
Maximum MWA of the humidity: 78 and
Total number of tweets from twitter with numerical data: 123
3.5.5 Neural Associations
An attribute-based association model is applied on all the tweets to fetch relevant attribute with numeric data extraction in the form recursive loops as the association is per attribute. For each neural recursion, an attribute association data will be populated which is of the filtration model. This filtration model provides all individual attributes maximum value, based on the maximum value per month as the above the model (neural association) is based on month wise individual calculation. Based on the output, the data numeric maximum values will be populated.
3.5.6 Functionality
This function instantiates all the widget component which are placed on the swing UI (JFrame) including JFrame.
This function will be used to trigger the code when a button with clickable event occurs. Two events are stored: A twitter event and a collected event (multi-dimensional).
If the category selected is Twitter: For each month, the input text file is stored as a string and the java StringTokenizer() class is called. A loop is used to iterate over the length of the text file. Checks are performed to determine if the tweets contain any of the 5 attributes mentioned (temp, humidity, wind direction, wind speed and air pressure). If a tweet contains the attribute, the getValueFromTweet() class is called, and the size is incremented. The consolidated value for each attribute is calculated. The maximum consolidated attribute value is determined and that attributed is resulted as the one which most affected that months weather. The sentiment frequency is displayed as a bar graph. The accuracy is calculated by dividing the maximum attributed value by total number of tweets which are posted on that attribute.
If category selected is collected: For each month, the POI API is used to load the excel sheet (input file) into java memory, where a XSSFWorkbook is created and the Microsoft excel sheet loaded is recreated. A rowIterator() and cellIterator() is used to iterate through each row and cell in the recreated excel workbook. A loop is used to iterate over the length of the workbook. The mean weight vector for each attribute is obtained. The maximum consolidated attribute mean weight is determined, and that attribute is resulted as the one which most affected that months weather. The sentiment frequency is displayed as a bar graph. The accuracy is calculated by dividing the maximum mean weighted attributed by the size.
The maximum attributes form a neural association. The prediction() class is called to predict the result of this neural association.
A public class which takes as inputs the tweet and attribute. It returns the data as an attribute count and the numerical value associated with that attribute.
This function takes as input the neural association, i.e. the highest sentiment frequency attributes, and returns the predictions based on the neural combinations of this association, by using weather labels.
4.1 Approaches
4.1.1 KNN (K- Nearest Neighbors)
There are several considerations important for the interpretation of the results.
While data is obtained month wise of an individuals’ expressed sentiment as reflected by their social media posts as tweets, optimal data would also include these individuals’ daily self-reported emotional states. While sentiment expressions on social media can be reflective of underlying emotions [30], the linguistic measures employed here represent an imperfect and noisy proxy of emotional factors. Further studies are needed to improve the accuracy validity of sentiment metrics based on attributes.
KNN will take inputs after classification per tuple entries. Thus, the single-entry tweets appear all the time. It was found that the accuracy of prediction very less and sometimes it is non comparable with other trialed approaches. Somehow this worked well with multi-dimensional data with Euclidian distances. But the accuracy is too weak as the k value generation is random and fluctuating.
4.1.2 LIWC (Linguistic inquiry word count)
The chosen LIWC sentiment metrics may imperfectly measure the sentiment of expressions on social media. The robustness is examined of the findings to the use of other sentiment classification tools with the Twitter data in SI: Alternative measures of expressed sentiment.
In these analyses both the key word strength and tweet’s priority algorithms were employed, and it was found that the results are quite robust across all three of the employed sentiment metrics. However, because all three of the metrics used likely have idiosyncratic errors associated with them, our measurement of the sentiment of expressions remains imperfect. To determine whether a social media post uses words that express positive or negative sentiment, it is relied on Linguistic Inquiry Word Count (LIWC) sentiment analysis tool22. [40]
LIWC is a highly validated, dictionary-based, sentiment classification tool that is commonly used to assess sentiment in social media posts [5],[6],[23],[24] (Note: the results obtained are similar under the use of alternative sentiment classifiers, SI: Alternative measures of expressed sentiment). In this analysis, positive and negative sentiment are treated as separate constructs [31].
4.1.3 Threshold and key-word based filtration
This was the final approach which gave a valid outcome of the result. This approach totally depends on the keywords of the attributes from the multi-dimensional data observed in the twitter cleaned data.
The keywords are:
 Temperature
 Humidity
 Wind Speed
 Wind Direction
 Air Pressure
Based on the above attributes, the tweets are populated as the data gathered from the weather forecast open source gateway links (multi-dimensional data) is based on the above mentioned attributes.
Thus, by getting the numeric data out of all the tweets and by comparing the result (per month and per attribute) with the attribute’s threshold, the maximum (highest frequency) among all attributes is found. This attribute would be regarded as the attribute which most affected the corresponding months weather with respect to twitter data.
With respect to the multi-dimensional data, the mean of all the above mentioned attributes are obtained, and the MWA (mean weight average) is calculated, checking for the maximum (highest frequency) among all the attributes mean with the individual thresholds. This attribute is regarded as the attribute which most affected the corresponding months weather.
Thus, for each month, a pair of attributes is obtained which most affected that months weather. This pair of attributes is compared against the standard weather labels with all neural combination of attributes from twitter and multi-dimensional data. Based on the combinations, the result is a predictive quantitative output based on prediction with neural combinational associations.
The combinations are as below:
Source Attribute
Target Attribute
Combinational Result
Hot and Humid
Wind Direction
Wind Speed
Warm and Windy
Air Pressure
Warm and Clear Skies
Hot and Humid
Wind Direction
Rain and Moving Clouds
Wind Speed
Rain and Storm
Air Pressure
Rain and Clear Skies
Wind Direction
Wind Direction
Moving Clouds and Rain
Wind Direction
Wind Direction
Wind Direction
Wind Speed
Cool and Windy
Wind Direction
Air Pressure
Windy and Clear Skies
Wind Speed
Warm and Windy
Wind Speed
Moving clouds and Rainy
Wind Speed
Wind Direction
Wind Speed
Wind Speed
Wind Speed
Air Pressure
Windy and Clear Skies
Air Pressure
Sunny and Clear Skies
Air Pressure
Rainy and Clear Skies
Air Pressure
Wind Direction
Windy and Cool with Clear Skies
Air Pressure
Wind Speed
Windy and Clear Skies
Air Pressure
Air Pressure
Cool and Clear Skies

Fig 11: Sentiment Analysis Dashboard
The above figure shows the main sentimental dashboard. This is designed using java swings. The two options of data population to memory are done here: from twitter and multi-dimensional datasets. Java API of files streams with Array Lists are used to accumulate data to java memory and by using tokenization all data is tokenized with comparison of all attributes.
With the calculation of the total and individual consolidated attributes mean weighted average. The highest value from the attributes will be populated based on the threshold value per attribute. This will be done for 3 individual months as the data is segregated month wise (May, June and July).
5.1 Selection of Data: Twitter
Following figures show the results for May, June and July

Fig 12: Consolidated Weather report for May (Twitter)
The above figure displays the results for the month of May. Evaluation is done based on the calculation of May month’s consolidated and aggregated mean weighted average of all attributes individually and by taking the individual thresholds. From the collection classes, the maximum valued attribute is obtained, as the sentimentally dominant attribute for that particular month’s data. Thus, May is dominant with high humidity.

Fig 13: Consolidated Weather report for June (Twitter)
The same process applied for June month. It is observed that June is dominant with high temperature.

Fig 14: Consolidated Weather report for July (Twitter)
Similarly, the same process applied for July month. It is observed that July is dominant with high humidity.
5.2 Selection of Data: Multi-Dimensional

Fig 15: Consolidated Weather report for May (Multi-Dimensional)
The above figure shows the result for May. The results show that May has high air pressure and humidity, based on the evaluation that mean weighted average of air pressure is high from the attribute calculation. This calculation is done from the POI populated values. The data is accommodated in array list of the java memory. These individual array lists are inputs to collection classes, to get the maximum valued attribute. Thus, in the month of May air pressure is high with this process.

Fig 16: Consolidated Weather report for June (Multi-Dimensional)
The above figure shows the results for June. The same process is applied as above. It is observed that June has high air pressure and humidity with this process.
The figure below, shows the results for July. The results show that July has high air pressure and humidity, based on the evaluation that mean weighted average of air pressure and humidity is high from the attribute calculation. This calculation is done from the POI populated values. The data is accommodated in array list of the java memory. These individual array lists are inputs to collection classes, to get the maximum valued attribute. Thus, in the month of May air pressure and humidity is high with this process.

Fig 17: Consolidated Weather report for July (Multi-Dimensional)
5.3 Sentimental Analysis Results

Fig 18: Maximum values per month
This above figure shows the result of the maximum values per month with respect to attribute and by observing the tweets for the month of May June and July has given the fluctuated results for twitter data and for last three multidimensional data has given similar data. So the tweets data is the fluctuated results and gave non similar results and multi-dimensional data is with similar results.

Fig 19: Accuracy

Fig 20: Final Accuracies
This above two figures show the accuracies. The formulas and calculations are shown below:
For twitter:
Total number of tweets with respect to max attribute / the total number of tweets.
The result is multiplied by 100.
For multidimensional:
Total mean weighted average per attribute (ie air pressure, if air pressure is maximum) is divided by total number of the gathered count for specific attribute (air pressure).
This result will be multiplied by 100.
The following figure shows the final representation for all the combinations, month wise and attribute wise (which is of maximum value).

Fig 21: Pie Chart of Accuracies
5.4 Prediction

Fig 22: Prediction results
The above figure shows the prediction results as combinational neural model attributes with all neural association from twitter prediction and multi-dimensional prediction.
This review discusses about a survey of various prediction methods by different researchers for weather forecasting data from social media as twitter and the gathered data from the accessible government weather forecast web link gateways. The work as well addressed the limitations and issues that needs attention while applying different methods of weather forecasting. The review shows that Threshold and keyword based filtration and mean weight average vector approach is better than other prediction techniques like KNN and LIWC and produces accuracy results. The review also identifies that KNN performs well for large scale basis (monthly). But for medium scale and daily basis LIWC produces less accuracy results. From the review, we identify that Threshold and keyword based filtration produces good results for monthly basis (large scale), this performs better for large scale basis and produces better results for short scale basis (monthly). Threshold based Classification technique is better for predicting the weather sentimentally with respect to all above approached techniques. Threshold and keyword based filtration with mean weight average offers highest accuracy of 87.24% this leads to a reduction in detection performance due to outliers.
[1] M. R. Bendre and R.C. Thool ,Big Data in Pricision Agriculture: Weather Forcasting for Future Farming, in the 1st International Conference on Next Generation Computing Technologies(NGCT-2015) Dehradun, 978-1-4673-6809-4/15IEEE-2015.pp. 744–750
[2] Sanjeev Kumar Singh, NeeruJaiswal et al. Early Detection of Cyclogenesis Signature Using Global Model Product, IEEE Transactions On Geoscience And Remote Sensing, Vol. 52, No. 8, August 2014.
[3] Takemasa, Kondo and Koji ,Big Ensemble Data Assimilation in Numerical Weather Prediction ,Grand Challenges In Scientific Computing, 0018- 9 162 / 15 IEEE Nov 2015.
[4] Walter Akio Goya et al.,The Use of Distributed Processing and Cloud Computing in Agricultural Decision-Making Support Systems, 2014 IEEE International Conference on Cloud Computing.
[5] Mohini P. Darji, Vipul K. Dabhi and HarshadkumarB.Prajapati, Rainfall Forecasting Using Neural Network: A Survey, ICACEA-2015, Ghaziabad, India.
[6] Geetha and Selvaraj, Prediction of monthly rainfall in Chennaiusing Back Propagation Neural Network model, Int. J. of Eng. Sci. and Technology, vol. 3, no. 1, pp. 211-213, 2011.
[7] AnshalSavla et al.,Survey of classification algorithms for formulating yield prediction accuracy in precision agriculture, IEEE Sponsored second International Conference on Innovations in Information, Embedded and Communication systems (ICIIECS) 2015.
[8] Huang Qing et al. “China Crop Growth Monitoring System-Methodology and Operational Activities Overview” in IEEE 2014.
[9] Keng-Pei Lin and Ming-Syan Chen, “On the Design and Analysis of the Privacy-Preserving SVM Classifier”, IEEE Transactions on Knowledgeand Data Engineering, vol. 23, no. 11, November 2011.
[10] Warunsin and Chitsobhuk “Cyclone identification using Fuzzy C Mean clustering” 13th International Symposium on Communications and Information Technologies (ISCIT) 978-1-4673-5580-3 IEEE 2013
[11] Kotal, Kundu, et al. Bhowmik, “Analysis of cyclogenesis parameter for developing and non-developing low pressuresystems over the Indian Sea,” Nat. Hazards, vol. 50, no. 2, pp. 389–402,Aug. 2009. [12] S. K. Nanda et al., “Prediction of rainfall in India using Artificial Neural Network (ANN) models,” Int. J. of Intell. Syst. and Applicat., vol. 5, no. 12, pp. 1-22, 2013.
[13] D. R. Nayak, A. Mahapatra, and P. Mishra, “A Survey on rainfall prediction using Artificial Neural Network,” Int. J. of Comput. Applicat.,vol. 72, no. 16, pp. 32-40, 2013.
[14] Shoba et al.., “Rainfall prediction using Data Mining techniques: A Survey,” Int. J. of Eng. and Comput. Sci., vol. 3, no. 5 pp. 6206-6211, 2014.
[15] Sangari and Balamurugan, “A Survey on rainfall prediction using Data Mining,” Int. J. of Comput. Sci. and Mobile Applicat., vol. 2, no. 2, pp. 84-88, 2014.
[16] Abbot and Marohasy, “Application of Artificial Neural Networks to rainfall forecasting in Queensland, Australia,” Advances in Atmospheric Sci.,vol. 29, no. 4, pp. 717-730, 2012.
[17] Dennis A. Ludena R et al. “2013 Second IIAI International Conference on Advanced Applied Informatics”.
[18] Gabriel et al. “Big Data environment for agricultural soil analysis from CT digital 2016 IEEE
[19] Wu Fan1 et al. “Prediction of crop yield using big data” 8th International Symposium on Computational Intelligence and Design. 2015 IEEE.
[20] Masayuki Hirafuji “A Strategy to Create Agricultural Big Data” 2014 SRIIGlobal Conference 978
[21] Dennis A. Ludena R. et al. “A Big Data approach for a new ICT Agriculture Application Development”, 2013 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery, 978-0-7695-5106-7/13 2013 IEEE
[22] Frederica Darema et al. “InfoSymbioticSystems -Large-Scale Dynamic Data and Large-Scale Big Computing for Smart Systems” , 2015 IEEE 22nd International Conference on High Performance Computing Workshops, 2015 IEEE
[23] Betts Integrated approaches to climate–crop modelling: needs and challenges. Phil.Trans.R.Soc.B. 2005
[24] Somvanshi et al. , Modeling and prediction of rainfall using Artificial Neural Network and ARIMA techniques,” J. Ind. Geophys. Union, vol. 10, no. 2, pp. 141-151, 2006.
[25 ] Nanda et al., “Prediction of rainfall in India using Artificial Neural Network (ANN) models,” Int. J. of Intell. Syst. and Applicat., vol. 5, no. 12, pp. 1-22, 2013.
[26] Ludena, Ahrary et al., “Big data approach in an ict agriculture project,” in Awareness Science and Technology, 2013 International Joint Conference on. IEEE, 2013, pp. 261–265.
[27] Kumar and Kumar et al. “A rainfall prediction model using artificial neural network,” Control and Syst. Graduate Research Colloq. (ICSGRC), pp. 82-87, 2012.
[28] Mahapatra et al., “A Survey on rainfall prediction using Artificial Neural Network,” Int. J. of Comput. Applicat.,vol. 72, no. 16, pp. 32-40, 2013.
[29] Dabhi and Chaudhary, “Hybrid Wavelet-Postfix-GP model for rainfall prediction of Anand region of India,” Advances in Artificia Intell., pp. 1-11, 2014.
[30] Hannak, A. et al. Tweetin’in the rain: Exploring societal-scale effects of weather on mood. in ICWSM
[31] Watson, D., Clark, L. A.