Week 9 – Performance, Efficiency and Optimization
[ad_1]Weekly Topics
• In this weeks class we’ll look at the following areas:
• Defining performance, efficiency and optimisation,
• Characterising algorithms and their performance using Big-O notation,
• Timing the execution of functions,
• Techniques to increase the performance or lower the resource requirements of our
programs,
• Using threads to ‘divide-and-conquer’ tasks which are parallel-isable,
• Using the right tool for the job when performance is required, and
• Using a profiler to identify the parts of our programs which will benefit most from
optimisation.
3
What is Performance?
• The performance of software is basically how quickly a piece of software
executes either in general, or to perform a given task.
• Typically, we want our software to run as quickly as possible, which means:
• No large delays between receiving an instruction to perform some work
and that work actually starting, and
• No large delays between the work starting and the work completing.
• While we can almost always increase the responsiveness of our applications in
terms of how long it takes between, say, clicking a button and getting some
kind of feedback that indicates that yes, we noticed a button has been
pressed – we’re working on it …
• …sometimes the amount of work that has to be done is so large (for example,
rendering a complex frame of animation in a CAD / rendering package) that we
just have to accept that it will take a long time!
Note: Performance with regard to responding to user interaction is called responsiveness – but this is really a type of performance.
4
What is Efficiency?
• Efficiency is the act of using as little resources as possible to achieve a
desired outcome.
• The resources we’re talking about can be those of:
• Processing / Time (less / faster is better),
• Memory / RAM (less is better),
• Storage space (less is better),
• Storage access (less is better),
• etc.
• Efficiency is double-edged sword though (i.e. it has trade-offs) – for example:
• We might be able to make our code use less memory (good!)…but it will
then take longer to process (bad!)
• Or, we might be able to use less storage access (which can be slow) so
that’s good…but then the program will use more memory.
Quiz: Would you prefer an application to require twice the RAM, but then it runs at twice the speed? Or maybe an
application to take twice as long to load, but then require only half the storage space?
5
What is Optimisation?
• Optimisation is modifying code to streamline it – so that the same amount of
work can be performed in less time than before, or that the same (or very similar)
results can be obtained by performing less work!
• In general we will do this by:
• Re-writing sections of code to function more efficiently,
• Looking at whether we need precise answers or approximate answers
(typically we can get an approximate answer much faster than an exact one!)
• Analysing our code to find where we can either avoid or reuse the results of
previous calculations, and
• Looking to make a favourable trade-off between speed and resource usage
(if one is more important than another).
6
Systematic Performance Tuning
• The general workflow to performance tune an application is as follows:
- Assess the problem and establish numeric values that categorize
acceptable behaviour (i.e. what is acceptable / desired performance?). - Measure the performance of the system before modification.
- Identify the part of the system that is critical for improving the performance.
This is called the bottleneck. - Modify that part of the system to remove or alleviate the bottleneck.
- Measure the performance of the system after modification.
- If the modification makes the performance better, adopt it. If the
modification makes the performance worse, put it back the way it was!
7
Performance Analysis
• Performance analysis, also called profiling, is the investigation of a program’s
behaviour using information gathered as the program executes.
• The goal of this is to determine which sections of a program to optimise.
• To do this, we can run profiler on a piece of code. A profiler is a performance
analysis tool that measures the behaviour of a program as it executes,
particularly the frequency and duration of function calls.
• By analysing our code using a profiler we can find:
• Hotspots – which are the pieces of code which run most often, and
• Bottlenecks – which are the pieces of code that run slowly, and thus slow
down the entire program.
• You can think of bottlenecks as small traffic jams on a freeway – only a
small part of the freeway might be jammed up, but it still slows down
everything that must pass through it!
8
Performance Analysis – Identifying Bottlenecks
• Consider a simple game loop for a typical videogame:
• What are the high-level hotspots for the game?
9
Performance Analysis – Identifying Bottlenecks (Cont’d)
• When performance tuning code – it is absolutely vital that you spend your time
and effort optimising the code which will return the most benefit!
• For example, you could optimise the data loader for your application and cut
the time it takes to load in half! Woo-hoo!…
• …but if the data loader originally takes one second to load – you’ve just
wasted (potentially) a large amount of time and effort to shave 500ms off a
process which occurs only once when the application is first loading!
• What if, instead, you spent that time and effort on shaving 2.5ms off the
processing time for something that runs 100 times per second?
• If your optimised code saves 2.5ms each time in runs, then:
• In 5 seconds you’ve saved 1,250ms (i.e. 1.25 seconds),
• In 5 minutes you’ve saved 75,000ms (i.e. 1 minute 15 seconds), and
• In 5 hours you’ve saved 4,500,00ms (i.e. 1 hour and 15 minutes)!
10
Performance Analysis – Identifying Bottlenecks (Cont’d)
• In computer programming there’s a thing called the 90/10 law that goes as
follows:
90% of your program’s execution time will be spent running
10% of your code.
• If you think about it – this is a very astute observation… Let’s take a video game
as an example. We might take 30 seconds while we:
• Load the menu/sounds/music/textures,
• Initialise input/audio/network/rendering subsystems..
• Then we’re ready to play the game (perhaps for hours!), where we:
• Get input,
• Respond to input & update game world,
• Draw gameworld & play audio
• Then later, we shut it all down – taking only a few seconds.
11
Performance Analysis – Identifying Bottlenecks (Cont’d)
• If we can use a profiler to identify the 10% or so of our code which is running
90% of the time – then THIS is where we should focus our optimisation effort!
• And this is generally the code which runs multiple times per second, as
opposed to initialisation code, or code that saves or loads data.
• Optimising code which rarely runs is simply not worth your time and effort
unless the delay is absolutely unacceptable – like taking more than a minute or
so to start up an application or save a file etc.
Question: Can you think of any apps or games that take an excessively long time to do something?
12
Algorithm performance and Big-O Notation
• The more work we give a program to do, the longer it takes to do that work.
• But… what is the relationship between the increased work and increased time
taken to perform that work?
• The answer is that it varies – but we can generally categorise it!
• Some of the more common categories of performance for a piece of code (from
fastest to slowest) are:
• O(1) – constant.
• O(log N) – logarithmic.
• O(N) – linear.
• O(N2) – quadratic.
• O(2N) – exponential.
• O(N!) – factorial.
SSuuppeerr ffaasstt!! ==DD
IInnccrreeddiibbllyy ssllooww ==((
13
• If we think of Big-O notation as a measure of the number of operations we need
to perform for a given number of ‘things’ (i.e. data elements or such), then we
end up with a table like this:
• As such, if we can use algorithms which have growth rates that are n2 or lower,
then the chances are they’ll operate quickly on our data sets – but if we hit O(n!)
we might be in trouble =/
Algorithm performance and Big-O Notation (Cont’d)
Big-O Operations for
10 “things”
Operations for 100
“things”
O(1) 1 1
O(log n) 3 7
O(n) 10 100
O(n log n) 30 700
O(n2) 100 10000
O(2n) 1024 2100
O(n!) 3,628,800 100! (i.e. 9.332622e+157)
Further reading: http://www.daveperrett.com/articles/2010/12/07/comp-sci-101-big-o-notation/
14
Algorithm performance and Big-O Notation (Cont’d)
• If we look at this graphically, we can really see how much of a difference the
algorithmic complexity really makes:
15
Measuring Execution Time in Python
• In the examples on the following slides we’re going to look at code that runs in
a variety of algorithmic complexities – so we’ll start by writing a function that can
measure the execution time of another function. Thankfully for us, Python
makes this pretty easy:
import time
Function to time a function and return the duration in ms
def timeFunction(some_function):
start_time = time.perf_counter()
some_function()
return (time.perf_counter() – start_time) * 1000
• How good is that? Now we can get the execution time of a function by calling
something like:
duration_ms = timeFunction(lambda: myFunction(some_args))
Note: Lamda functions are a little outside the scope of this course – just consider them ways to quickly and easily say “use this as
the body of a function”. Further reading: https://pythonconquerstheuniverse.wordpress.com/2011/08/29/lambda_tutorial/
16
Big-O Examples: O(1) – ‘Constant Time’
• O(1) describes an algorithm that will always execute in the same time (or
space) regardless of the size of the input data set.
• For example, consider the following method:
def isFirstElementZero(the_list):
if (the_list[0] == 0):
return True
return False
• In this example, it doesn’t matter if our array has:
• 1 element,
• 1,000 elements, or
• 1,000,000 elements…
• …it’ll still take the exact same amount of time to execute because it’s always
doing the same amount of work (i.e. checking the very first element at index 0).
Adapted from: http://rob-bell.net/2009/06/a-beginners-guide-to-big-o-notation/
17
Big-O Examples: O(1) – ‘Constant Time’ (Cont’d)
• Let’s test this out – we should get the same result regardless of our list size:
small_list = [None] * 10000 # Ten thousand
medium_list = [None] * 100000 # One hundred thousand
large_list = [None] * 1000000 # One million
duration = timeFunction(lambda: isFirstElementZero(small_list))
print(“Small duration: ms”.format(duration))
duration = timeFunction(lambda: isFirstElementZero(medium_list))
print(“Medium duration: ms”.format(duration))
duration = timeFunction(lamda: isFirstElementZero(large_list))
print(“Large duration: ms”.format(duration))
• And our survey says…:
Small duration: 0.0009752375008076187ms
Medium duration: 0.0007314281256056224ms
Large duration: 0.000731428125606056ms
Adapted from: http://rob-bell.net/2009/06/a-beginners-guide-to-big-o-notation/
18
Big-O Examples: O(N) – Linear Time
• O(N) describes an algorithm whose performance will grow linearly and in direct
proportion to the size of the input data set.
• It’s worth noting that when we use Big-O notation we can refer to the bestcase,
average-case and worst-case performance scenario.
• In these examples we’re always going to go for the worse-case scenario where
the full sequence of events in a loop must occur (i.e. we go through the
maximum amount of iterations for a given algorithm).
def containsValue(the_list, some_value):
for i in range(0, len(the_list)):
if (the_list[i] == some_value):
return True
return False
• Let’s time this code (looking for a value that doesn’t exist in this instance) and
use our timeFunction function to see how it gets on…
19
Big-O Examples: O(N) – Linear Time (Cont’d)
• Giving this a shot works out as follows:
small_list = [None] * 10000 # Ten thousand
medium_list = [None] * 100000 # One hundred thousand
large_list = [None] * 1000000 # One million
duration = timeFunction(lambda: containsValue(small_list, 6))
print(“Small duration: ms”.format(duration))
duration = timeFunction(lambda: containsValue(medium_list, 6))
print(“Medium duration: ms”.format(duration))
duration = timeFunction(lambda: containsValue(large_list, 6))
print(“Large duration: ms”.format(duration))
• And the output when running the above is:
Small duration: 0.49249493790784743ms
Medium duration: 5.031006457291302ms
Large duration: 50.59654058877526ms
Why 6? Why not! Anything
except None will force a full
search.
20
Big-O Examples: O(N2) – Quadratic Time
• O(N2) represents an algorithm whose performance is directly proportional to the
square of the size of the input data set.
• This is common with algorithms that involve nested iterations over the data set.
• Deeper nested iterations will result in complexities of O(N3), O(N4) and so on.
def containsDuplicates(some_list):
list_range = range(0, len(some_list))
for i in list_range:
for j in list_range:
if (i != j and some_list[i] == some_list[j]):
return True
return False
• Note: The i != j check is there to stop the same value being matched as a
duplicate with itself.
21
Big-O Examples: O(N2) – Quadratic Time (Cont’d)
• Coding this up and timing it gives the following results:
small_list = range(100) # One hundred
medium_list = range(1000) # One thousand
large_list = range(10000) # Ten thousand
duration = timeFunction(lambda: containsDuplicates(small_list))
print(“Small duration: ms”.format(duration))
duration = timeFunction(lambda: containsDuplicates(medium_list))
print(“Medium duration: ms”.format(duration))
duration = timeFunction(lambda: containsDuplicates(large_list))
print(“Large duration: ms”.format(duration))
• After a while we get the results back, which are:
Small duration: 1.74860083894806ms
Medium duration: 215.25734689075998ms
Large duration: 22791.810443086964ms
Note: I had to make the array sizes significantly smaller this time because otherwise the large_list of one million
values was going to take several hours! Ain’t nobody got time fo’ dat!
22
Big-O Examples: O(N2) – Quadratic Time (Cont’d)
• I wasn’t entirely convinced that the times we obtained matched a quadratic
pattern (they don’t quite) so I graphed up a series of runs using matplotlib – and
you can see that things do work out roughly quadratically in their expansion:
Small duration: 1.74860083894806ms
Medium duration: 215.25734689075998ms
Large duration: 22791.810443086964ms
23
Threading
• Back when most computers had only a single CPU, single threaded code was
fine – it would happily work away as fast as it could, and that’s the most we
could expect from it.
• Modern devices, whether PCs or phones, commonly have 2, 4, 8 or more CPU
cores available…
• …so if your code is single threaded, on a quad-core CPU, then you’re using at
most 25% of the possible computational capacity of the device (or even just
12.5% if it has two hardware threads per core!).
• Using multiple threads to perform work requires that we divide the work up to
be shared by multiple CPUs, which means more effort for us (as developers) –
but significantly more performance out of our code.
• There are various threading libraries which can be used to make life easier for
us to share out work between our CPU cores – which one you use (if any – you
can always will depend on the platform / language you’re working on.
24
Threading (Cont’d)
• Let’s take an example of rendering a Mandelbrot fractal in Python. The
Mandelbrot set is a recursive mathematical equation – and if we visualise it, we
end up with an infinitely deep pattern that’s self-similar.
25
Mandelbrot Generation Code – Single Threaded
import time
import matplotlib.pyplot as plt
import numpy as np
Function to time a function and return the duration in ms
def timeFunction(some_function):
start_time = time.perf_counter()
some_function()
return (time.perf_counter() – start_time) * 1000
def mandelbrot(a):
z = 0
for n in range(1, 100):
z = z**2 + a
if abs(z) > 2:
return n
return np.NaN
26
Mandelbrot Generation Code – Single Threaded (Cont’d)
def draw_mandelbrot_set():
X = np.arange(-2, .5, .005)
Y = np.arange(-1, 1, .005)
Z = np.zeros((len(Y), len(X)))
for iy, y in enumerate(Y):
print (iy, “of”, len(Y))
for ix, x in enumerate(X):
Z[iy,ix] = mandelbrot(x + 1j * y)
Note: plt.cm.gist_ncar is a colour map
plt.imshow(Z, plt.cm.gist_ncar, extent = (X.min(),
X.max(), Y.min(), Y.max()))
duration = timeFunction(lambda: draw_mandelbrot_set())
print(“Mandelbrot generation: ms”.format(duration))
plt.show()
27
Mandelbrot – Single Threaded
• Running this initial multi-threaded version on my relatively old work PC (intel i7-
3770 @ 3.4GHz) gives us around 10.8 seconds to render the Mandelbrot:
Mandelbrot generation duration: 10815.54443433882ms
28
Mandelbrot – Compiled Python via the Numba Library
• As we know, Python is an interpreted language – which means that each line
has to be converted from Python to machine code every time it runs.
• But… we can use the Numba library to force functions to be “Just In Time”
compiled – which means that the first time they execute they get compiled into
machine code – and then every time we call the function after that we end up
using the previously compiled version instead of having to ‘interpret’ the
Python code into machine code again!
• We can install the Numba library by simply calling pip install numba like this:
Further reading: https://numba.pydata.org/
29
Mandelbrot – Compiled Python via the Numba Library (Cont’d)
• To use the Numba library we just import numba and then prefix the functions
that we want to be compiled with the tag @jit
from numba import jit
@jit
def mandelbrot(a):
z = 0
etc…
@jit
def draw_mandelbrot_set():
X = np.arange(-2, .5, .002)
etc…
• Now, let’s run the code with those simple changes made and see how long the
Mandelbrot takes to calculate this time (we’re aiming to beat 10 seconds).
30
Mandelbrot – Compiled Python via the Numba Library (Cont’d)
• This time, by using the Numba library and adding just three lines of code, we’ve
gone from 10.8 seconds to about 4.5 seconds! Result! =D
Mandelbrot generation duration: 4521.03047977428ms
But we can do even better!
31
Mandelbrot – Python Multi-Threaded Version
• To utilise multiple threads to solve a problem we need to ‘divide and conquer’ it
- which in this case means we need to break up the draw_mandelbrot_set
function into a function which draws a section of it.
• The mandelbrot(a) function stays the same, but we need to refactor the code
which calculates the larger set – so let’s!:
def draw_mandelbrot_set(xSlice, ySlice, yCounter):
global Z
for iy, y in enumerate(ySlice, yCounter):
print (iy, “of”, len(Y))
for ix, x in enumerate(xSlice, 0):
Z[iy,ix] = mandelbrot(x + 1j * y)
• Rather than drawing the entire Mandelbrot set, the code above will draw a
horizontal section of it from the far-left to the far-right.
32
Mandelbrot – Python Multi-Threaded Version
• Now we’ll modify our main code to use threads:
if name == ‘main‘:
start_time = time.perf_counter()
X = np.arange(-2, .5, .004)
Y = np.arange(-1, 1, .004)
Z = np.zeros((len(Y), len(X)))
Specify how many threads we want
num_threads = 2
Get some values ready to divide up the set into slices
yStep = len(Y) // num_threads
yCount = 0
ySlice = list(range(num_threads))
33
Mandelbrot – Python Multi-Threaded Version (Cont’d)
threadArray = [] # To hold our array of threads
For each thread…
for i in range(num_threads):
Take a horizontal slice of the array
ySlice[i] = Y[yCount:yCount+yStep]
Assign a thread to calculate this section
thread = t.Thread(target=draw_mandelbrot_set, args=(X,
ySlice[i], yCount))
Put the thread in our array & start the thread
threadArray.append(thread)
thread.start()
Get ready for the next ‘slice’ of the Mandelbrot set
yCount += yStep
34
Mandelbrot – Python Multi-Threaded Version (Cont’d)
• Finally (outside the loop-over-threads we were just in) – we can wrap
up:
Wait for all threads to complete before continuing
for i in threadArray:
i.join()
end_time = time.perf_counter()
duration = end_time – start_time
plt.imshow( Z, plt.cm.gist_ncar, extent = (X.min(),
X.max(), Y.min(), Y.max()) )
print(“Mandelbrot calc duration: ms”.format(duration))
plt.show()
• At this point we can do multiple runs of our code with varying numbers
of threads and see what the performance is like…
35
Mandelbrot – Python Multi-Threaded Version (Cont’d)
• After all our hard-work in refactoring the code to execute in parallel these are
our results when calculating using between 1 and 10 threads…
Mandelbrot calc time 1 threads: 10.34ms
Mandelbrot calc time 2 threads: 10.36ms
Mandelbrot calc time 3 threads: 10.27ms
Mandelbrot calc time 4 threads: 10.47ms
Mandelbrot calc time 5 threads: 10.34ms
Mandelbrot calc time 6 threads: 10.21ms
Mandelbrot calc time 7 threads: 10.31ms
Mandelbrot calc time 8 threads: 10.52ms
Mandelbrot calc time 9 threads: 10.43ms
Mandelbrot calc time 10 threads: 10.40ms
• Python implements a Global Interpreter Lock which only allows a single
thread’s bytecode to execute at any given time! This enforces thread-safety
(yay!) at the cost of speed (boo!). To make this work quickly using multiple
CPU cores we’re going to need parallel processes – so let’s get it on!
36
Mandelbrot – Python Multi-Process Pooled Version
• Pools are a way to allow Python to spin-up as many worker processes as it
needs to get a job done.
• The default is to create a worker for each CPU core or hardware thread. As the
machine I’m working on has 2 hardware-threads per core, and 4 cores Python
will spin up eight workers to perform our calculations.
import time
import matplotlib.pyplot as plt
import multiprocessing as mp
from functools import partial
Function to time a function and return the duration in ms
def timeFunction(some_function):
start_time = time.perf_counter()
some_function()
return (time.perf_counter() – start_time) * 1000
37
Mandelbrot – Python Multi-Process Pooled Version (Cont’d)
def mandelbrotCalcRow(yPos, w, h, max_iterations = 100):
y0 = yPos * (2/float(h)) – 1 # Rescale to -1 to 1
row = []
for xPos in range(w):
x0 = xPos * (3.0/float(w)) – 2.5 # Rescale to -2.5 to 0.5
iteration, z = 0, 0 + 0j
c = complex(x0, y0)
while abs(z) < 2 and iteration < max_iterations:
z = z**2 + c
iteration += 1
row.append(iteration)
return row
def mandelbrotCalcSet(width, height, max_iterations = 100):
Make a helper function that better supports workers
partialCalcRow = partial(mandelbrotCalcRow, w=width,
h=height, max_iterations=max_iterations)
Create a pool of workers
pool = mp.Pool()
38
Mandelbrot – Python Multi-Process Pooled Version (Cont’d)
Pool.map only accepts one iterable, so use our partial func
mandelImg = pool.map(partialCalcRow, range(height))
Close pool so that it cannot accept any more tasks
pool.close()
Force all workers to complete before continuing
pool.join()
Show the fractal
plt.imshow(mandelImg, plt.cm.gist_ncar)
if name==’main‘:
duration = timeFunction(lambda: mandelbrotCalcSet(625, 500, 100))
print(“Mandelbrot generation duration: ms”.format(duration))
plt.show()
39
Mandelbrot – Python Multi-Process Pooled Version (Cont’d)
• At the end of the day, each worker will calculate an eighth of the entire fractal,
delivering us our glorious Mandelbrot plot as before, but…
Note: The eight top-most 8 Python processes are doing work (one worker per hardware thread), and the last Python worker is
just the command prompt waiting for the workers in the pool to finish!
+
= Mandelbrot generation duration: 2614.86ms
Work!
40
Mandelbrot – Python Multi-Process Pooled Version (Cont’d)
• We might now think – what about if we go back use the Numba library to @jit
inline the Pool-worker code?
• Will it go any faster?
YYEESS NNOO
41
Mandelbrot – Python Multi-Process Pooled Version (Cont’d)
• Funnily enough – both yes and no – depending on the circumstances!
• By default we’ve been rendering our Mandelbrot at a resolution of 625×500
pixels…
• …but this is a small number of calculations, so the additional overhead of
compiling the functions using the JIT compiler means that for a small image like
this it actually takes longer than not compiling and inlining them!
• However, if we increase the resolution to something like 1920×1080, then the
additional speed per execution more than compensates for the compilation
overhead so it ends up going faster with the functions inlined!
625×500 1920×1080
Not Inlined 2,673ms 11,251ms
Inlined via JIT 3,247ms 5,622ms
Inlining slows
down small
workloads
due to
overhead of
compilation.
Inlining
speeds up
larger
workloads due
to greater
execution
speed.
42
Mandelbrot – GPU Version
• At this stage we’ve pretty much hit the limit of how fast Python can generate the
Mandelbrot set. There might be some small code optimisation tweaks, but
nothing too major – we’re simply using all the processing capacity the CPU has
available.
• However, the CPU typically the only processing device on a machine, we also
have a GPU (i.e. graphics processing unit)… and these tend to have not just 4
or 8 cores, but thousands of cores!
Note: CUDA is Nvidia’s version of OpenCL – it’s a general purpose computation framework that can tap into all the graphics
processing cores to crunch numbers rather than draw graphics. Further reading: https://developer.nvidia.com/cuda-zone
3,584
CUDA
cores!
43
GPU
• Because using the GPU via CUDA or OpenMP or such in Python is a little beyond
the scope of this course – I’ve put together some C++ code that calculates the
Mandelbrot fractal using the GPU and GLSL (the openGL Shading Language).
• Our previous best was around 2 seconds per frame – but running this processing
on the GPU runs it at around 100 frames per second! (i.e. approx. 10
milliseconds or 0.01 seconds per frame) Result! =D
• Video: https://youtu.be/f3fOoB9b9mE
Further reading: OpenMP – http://www.openmp.org/ , CUDA – https://developer.nvidia.com/about-cuda
44
Profiling
• A profiler is a tool that can be run alongside your code as it executes – and it
times how long each function call takes.
• So if you want your program to run faster, once you know which parts of your
code are the slowest ones you know where you should concentrate your
optimisation effort!
• Let’s take the example of a video game as an application which has very tight
performance requirements.
• If the game is required to run at 60 frames per second (fps), then it only has
one-sixtieth (1/60) of a second to process and render each frame…
• …which is a mere 8.34 milliseconds per frame!
• In that time the game may have to process audio, game logic, artificial
intelligence, networking AND draw the graphics!
45
Profiling (Cont’d)
• Let’s write some code that simulates the kind of processing workload this is:
import math
import random as rnd
ai_calls = 200; audio_calls = 350; game_logic_calls = 500
def do_ai():
for loop in range(ai_calls):
foo = rnd.randint(0, 123456)
def do_audio():
for loop in range(audio_calls):
bar = rnd.randint(0, 123456) – 1.2 * 3.4 / 5.6 – 6.7
def do_game_logic():
for loop in range(game_logic_calls):
baz = math.sqrt( (rnd.randint(0, 123456) – 1.2 * 3.4 / 5.6 –
6.7)2 ) 46 Profiling (Cont’d) • Now that we have some functions that crudely simulate the workload of running a game, we need to call them. In this example, we’ll run for 1000 simulated frames. def run_game(): frame_count = 0 while frame_count < 1000: do_ai() do_audio() do_game_logic() frame_count += 1 print(‘Frame:’, frame_count) import cProfile cProfile.run( ‘run_game()’ ) Profile the run_game function, please! 47 Profiling (Cont’d) • Here’s a section of the output from profiling the code: ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 5.406 5.406 :1() 1050000 0.603 0.000 1.273 0.000 random.py:173(randrange) 1050000 0.294 0.000 1.568 0.000 random.py:217(randint) 1050000 0.458 0.000 0.670 0.000 random.py:223(_randbelow) … 1000 0.090 0.000 0.597 0.001 test.py:13(do_audio) 1000 0.259 0.000 1.070 0.001 test.py:17(do_game_logic) 1 0.002 0.002 5.406 5.406 test.py:21(run_game) 1000 0.039 0.000 0.343 0.000 test.py:9 (do_ai) • ncalls is the number of calls to the function, tottime is the total time spent in a given function (NOT including calls to sub-functions), percall is tottime divided by ncalls. • cumtime is the cumulative time spent in this and all subfunctions, percall is cumtime / ncalls and filename:lineno(function) is what function was called and where it was called from. Note: Cumulative time is probably the most important metric to us out of all of these. 48 Profiling (Cont’d) ncalls tottime percall cumtime percall filename:lineno(function) 1000 0.090 0.000 0.597 0.001 test.py:13(do_audio) 1000 0.259 0.000 1.070 0.001 test.py:17(do_game_logic) 1 0.002 0.002 5.406 5.406 test.py:21(run_game) 1000 0.039 0.000 0.343 0.000 test.py:9 (do_ai) • So looking at the above, we can see that our do_ai, do_audio and do_game_logic functions each ran 1000 times… • …and do_ai ran for 0.343 seconds, do_audio ran for 0.597 seconds and do_game_log ran for 1.07 seconds. • As such, if we wanted things to run faster, then we should probably look into optimising the do_game_logic function first, as that’s the one taking up most of our processing time! 49 Profiling (Cont’d) • So looking at our do_game_logic function again, any idea how we could optimise it? def do_game_logic(): for loop in range(game_logic_calls): baz = math.sqrt( (rnd.randint(0, 123456) – 1.2 * 3.4 / 5.6 – 6.7)2 )
• I’ll give you a few seconds to think about it…
50
Profiling (Cont’d)
• So looking at our do_game_logic function again, any idea how we could
optimise it?
def do_game_logic():
for loop in range(game_logic_calls):
bar = math.sqrt( (rnd.randint(0, 123456) – 1.2 *
3.4 / 5.6 – 6.7)**2 )
• I’ll give you a few seconds to think about it…
• Answers:
• As 1.2 * 3.4 / 5.6 – 6.7 is always going to be the same value we could precalculate
the result and use that pre-calculated result rather than actually
calculating the value each time!
• The result of taking the square root of a value which has been squared is
the same as the original value – so we could just NOT square and then
square-root the value and we’d still get the exact same result!*
- = Remember – the quickest work to perform is the work that you do not have to perform!
51
Performance in General
• The number one priority when designing your applications for performance is to
choose the right algorithms and the right data structures for the job.
• When you have these things right – then performance will be ‘built-in’ to the
application by default!
Top Tip: Don’t know much about algorithms and data structures? Read a book on ’em and you will. That’s how books work =P
52
Performance in General (Cont’d)
• Once you have your data structures and algorithms – there are two basic rules for
writing efficient code:
• Don’t do work that you don’t need to do.
and,
• Don’t allocate memory if you can avoid it.
• You can not do work really, really fast!
• Look at that! I just did not dig a tunnel to China! And I could not dig five hundred
more just as fast!
• Allocating memory takes time to allocate, the memory has to be tracked in referencecounted/
garbage-collected languages like Python (which takes time), and finally the
memory must be freed which, you guessed it, takes time!
53
Performance in General (Cont’d)
• Object creation is never free. A generational garbage collector with per-thread
allocation pools for temporary objects can make allocation cheaper, but
allocating memory is always more expensive than not allocating memory.
• As you allocate more objects in your app, you will force a periodic garbage
collection, which can create little “hiccups” in the user experience.
• If you can modify a data structure ‘in-place’ (that is, without first having to
make a copy of it) – then chances are it’ll be quicker than if you had created a
copy during the process and then assigned it back to the original structure.
• Sometimes you can pre-calculate values and store them so that rather than
being calculated live they can be looked-up. This can mean:
• When your application starts up you do the pre-calculation and either store
the results in memory or in a file, or
• When your application needs a value it could read it from a file of precalculated
values.
Note: Generational garbage collection is outside the scope of this course, but if you want to know how it all works, try:
http://www.oracle.com/technetwork/java/gc-tuning-5-138395.html#1.1.%20Generations|outline
54
Performance Tips – Minimise Subclass Nesting
• Don’t nest classes more than a few deep if you can help it.
• Imagine that you had a class that had a subclass…
• …that had a subclass…
• …that had a subclass…
…that had a subclass…
» …that had a subclass…
• …that had a subclass…
• …that had a subclass…
• …that had a subclass…
• …that had a subclass.
• When you’re working with methods of data in the lower ‘great-great-grandchild’
classes the computer has to do a lot of re-direction to step from class to class
to class to finally find the address of the instructions to perform.
• This is slow – don’t do it
55
Performance Tips – Use Constants Where Possible
• Python doesn’t have constants (i.e. values that cannot be changed) – but most
other languages do.
• If a value is never going to change, i.e. PI = 3.14159 – if you can mark it as
constant via const or final or such (depending on what language you’re
working in) then you should do so.
• The reason being is that if a compiler or interpreter knows a value will never
be changed, it may be able to apply some of its own optimisation logic to
the code which it might otherwise not be able to perform.
• Also, it stops you from accidentally changing the value =D
Note: The image isn’t especially related to what we’re talking about – but it’s a great quote and I assure you it’s the truth. So if
you’re one of those people who hate change, expect to spend a lot of your time hatin’. Or even better, learn to embrace change.
56
Performance Tips – Use Primitives Instead Of Objects
• Use primitives over objects when possible. This isn’t necessarily a Python issue as Python’s a
loosely-typed language, but in strongly typed languages (for example, Java) the available types come
in both primitive and object varieties:
• So primitives would be data types such as:
• int,
• float,
• double,
• boolean etc.
• But we also have their Object equivalents:
• Integer,
• Float,
• Double,
• Boolean etc.
• By using primitives rather than objects our calculations can occur significantly faster because to work
with the object versions the language commonly has to unbox the value (object to primitive), then do
some work, then box the value (primitive back to object).
• Also – using primitives is less likely to lead to garbage collection (which can seriously degrade
performance).
57
Performance Tips – Use Primitives Instead Of Objects (Cont’d)
• For example in Java, this code:
Random random = new Random();
Double total = 0;
for (Integer loop = 0; loop < 1000000; ++loop)
total += random.nextDouble();
• Will run 6-7x slower (or worse!) than this equivalent code:
Random random = new Random();
double total = 0.0;
for (int loop = 0; loop < 1000000; ++loop)
total += random.nextDouble();
No boxing or
unboxing
required.
Lots of
boxing &
unboxing
required.
58
Performance Tips – Use Prefix Operators Over Postfix Operators
• Again, this one’s not for Python, but it’s for many other languages. Let’s say we
have a value like this:
int foo = 5;
• When we use a prefix operator (++foo, –foo etc.) then we change the value
immediately:
System.out.println(++foo); // Prints 6
System.out.println(–foo); // Prints 5
• But when we use postfix operator (foo++, foo– etc.), we make a copy of the
value (or object), then we use the original object, then we run the prefix
operator on the copy, then we return the copy!
System.out.println(foo++) // Prints 5… but after line value is 6
System.out.println(foo–) // Prints 6… but after line value is 5
• Guess which one of these is quicker? Yup, the prefix operator!
59
Performance tips – Trade speed for increased resource use
• Let’s say we’re working on an app which used particle systems, such as a
fireworks simulation (like rockets that zoom up into the air and explode).
• Every time a firework explodes, we may display 1,000 particles…
• …and every time a particle explodes, we may want to have:
• Different colours (per particle – so we need red/green/blue values),
• Different speeds (per particle – so we need x/y/z values)
• For 1,000 particles – we’re now up to 6,000 new random values per explosion.
• Let’s say we have 100 fireworks, and one explodes on average once every half
second – we’re now looking at 12,000 new random values per second!
• Generating random values is typically quite quick… but it’s not all that quick.
And it’s certainly not free!
60
Performance tips – Trade speed for increased resource use (Cont’d)
• So – what can we do to decrease the computational requirements of generating
all these random numbers all the time?
• Well, what about if we create a bunch of them early on, and then re-use them?
• For example:
• When we initialise our program, we might create an array of 1,000,000
random numbers (why not – it’s only 4MB if they’re floats),
• And we’ll create a currentRandomNumber int to keep track of which one
we’re using…
• Then, we can just get a random number as required with:
float myRandomNumber = randomNumberList[currentRandomNumber++];
Note: We’d have to wrap the currentRandomNumber value around when it hits our max index – but that’s trivial!
61
Performance Tips – Use pools
• We kind-of used pools in the last example on random numbers, but to do this
properly we’d create some kind of wrapper-class which operates on the type of
resource we want to re-use.
• Setting up and tearing down things like audio subsystems, database
connections, network connections etc. can take a lot of time.
• A better technique which avoids all this setup and tear down is that we:
- Perform the expensive set-up just once, then
- Re-use the system/connection we’ve set up as required, and finally
- Tear down the system/connection at the end when we’re finished.
• By doing this, we minimise the number of times we need to perform expensive
setup operations…
• …which results in our code having the resources it requires available far
quicker than having to do the setup each time.
62
Wrap up
• That’s it – we’ve covered everything we planned on covering in this course! =D
• Today we’ve talked about the performance of programs and how we can make
them more efficient,
• We’ve looked at big-O notation that describes the performance characteristics
of algorithms (which is where we began back in week 1!),
• We’ve discussed ways in which we can optimise programs, or how we can use
the processing capacity our available hardware by using threads or pools, or
even moving the work from the CPU to the GPU,
• We’ve touched on profiling and how it can help us identify the areas in our
code that we should focus on if we want it to run faster, and (perhaps most
importantly):
[Button id=”1″]
[ad_2]
Source link
"96% of our customers have reported a 90% and above score. You might want to place an order with us."
